Class: Traject::NokogiriReader
- Inherits:
-
Object
- Object
- Traject::NokogiriReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/nokogiri_reader.rb
Overview
A Trajet reader which reads XML, and yields zero to many Nokogiri::XML::Document objects as source records in the traject pipeline.
It does process the entire input document with Nokogiri::XML.parse, DOM-parsing,
so will take RAM for the entire input document, until iteration completes.
(There is a separate half-finished ExperimentalStreamingNokogiriReader
available, but it is
experimental, half-finished, may disappear or change in backwards compat at any time, problematic,
not recommended for production use, etc.)
You can have it yield the entire input XML as a single traject source record
(default), or you can use setting nokogiri.each_record_xpath
to split
the source up into separate records to yield into traject pipeline -- each one
will be it's own Nokogiri::XML::Document.
Settings
- nokogiri.default_namespaces: Set namespace prefixes that can be used in
other settings, including
extract_xpath
from NokogiriMacros. - nokogiri.each_record_xpath: if set to a string xpath, will take all matching nodes
from the input doc, and yield the individually as source records to the pipeline.
If you need to use namespaces here, you need to have them registered with
nokogiri.default_namespaces
. If your source docs use namespaces, you DO need to use them in your each_record_xpath. - nokogiri_reader.extra_xpath_hooks: Experimental in progress, see below.
nokogiri_reader.extra_xpath_hooks: For handling nodes outside of your each_record_xpath
What if you want to use each_record_xpath to yield certain nodes as source documents, but there is additional some other info in other parts of the input document you need? This came up when developing the OaiPmhNokogiriReader, which yields "//oai:record" as pipeline source documents, but also needed to look at "//oai:resumptionToken" to scrape the entire results.
There is a semi-finished/in-progress feature that meets that use case -- unclear if it will meet all use cases for this general issue.
Setting nokogiri_reader.extra_xpath_hooks
can be set to a Hash where the keys are xpaths (if using
namespaces must be must be registered with nokogiri.default_namespaces
), and the value is a lambda/
proc/callable object, taking two arguments.
provide "nokogiri_reader.extra_xpath_hooks", {
"//oai:resumptionToken" =>
lambda do |node, clipboard|
clipboard[:resumption_token] = node.text
end"
}
The first arg is the matching node. What's this clipboard? Well, what are you gonna do with what you get out of there, that you can do in a thread-safe way in the middle of nokogiri processing? The second arg is a thread-safe Hash "clipboard" that you can store things in, and later access via reader.clipboard.
There's no great thread-safe way to get reader.clipboard in a normal nokogiri pipeline though, (the reader can change in multi-file handling so there can be a race condition if you try naively, don't!) Which is why this feature needs some work for general applicability. The OaiPmhReader manually creates it's readers outside the usual nokogiri flow, so can use it.
Instance Attribute Summary collapse
-
#clipboard ⇒ Object
readonly
Returns the value of attribute clipboard.
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#path_tracker ⇒ Object
readonly
Returns the value of attribute path_tracker.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #default_namespaces ⇒ Object
- #each ⇒ Object
- #each_record_xpath ⇒ Object
- #extra_xpath_hooks ⇒ Object
-
#initialize(input_stream, settings) ⇒ NokogiriReader
constructor
A new instance of NokogiriReader.
Constructor Details
#initialize(input_stream, settings) ⇒ NokogiriReader
Returns a new instance of NokogiriReader.
61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/traject/nokogiri_reader.rb', line 61 def initialize(input_stream, settings) @settings = Traject::Indexer::Settings.new settings @input_stream = input_stream @clipboard = Traject::Util.is_jruby? ? Concurrent::Map.new : Concurrent::Hash.new default_namespaces # trigger validation validate_xpath(each_record_xpath, key_name: "each_record_xpath") if each_record_xpath extra_xpath_hooks.each_pair do |xpath, _callable| validate_xpath(xpath, key_name: "extra_xpath_hooks") end end |
Instance Attribute Details
#clipboard ⇒ Object (readonly)
Returns the value of attribute clipboard.
59 60 61 |
# File 'lib/traject/nokogiri_reader.rb', line 59 def clipboard @clipboard end |
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
59 60 61 |
# File 'lib/traject/nokogiri_reader.rb', line 59 def input_stream @input_stream end |
#path_tracker ⇒ Object (readonly)
Returns the value of attribute path_tracker.
59 60 61 |
# File 'lib/traject/nokogiri_reader.rb', line 59 def path_tracker @path_tracker end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
59 60 61 |
# File 'lib/traject/nokogiri_reader.rb', line 59 def settings @settings end |
Instance Method Details
#default_namespaces ⇒ Object
81 82 83 84 85 86 87 |
# File 'lib/traject/nokogiri_reader.rb', line 81 def default_namespaces @default_namespaces ||= (settings["nokogiri.namespaces"] || {}).tap { |ns| unless ns.kind_of?(Hash) raise ArgumentError, "nokogiri.namespaces must be a hash, not: #{ns.inspect}" end } end |
#each ⇒ Object
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/traject/nokogiri_reader.rb', line 89 def each whole_input_doc = Nokogiri::XML.parse(input_stream) if each_record_xpath whole_input_doc.xpath(each_record_xpath, default_namespaces).each do |matching_node| # We want to take the matching node, and make it into root in a new Nokogiri document. # This is tricky to do as performant as possible (we want to re-use the existing libxml node), # while preserving namespaces properly (especially in jruby). Some uses of noko api that seem # like they should work don't, esp in jruby. child_doc = Nokogiri::XML::Document.new reparent_node_to_root(child_doc, matching_node) yield child_doc child_doc = nil # hopefully make things easier on the GC. end else # caller wants whole doc as a traject source record yield whole_input_doc end run_extra_xpath_hooks(whole_input_doc) ensure # hopefully make things easier on the GC. whole_input_doc = nil end |
#each_record_xpath ⇒ Object
73 74 75 |
# File 'lib/traject/nokogiri_reader.rb', line 73 def each_record_xpath @each_record_xpath ||= settings["nokogiri.each_record_xpath"] end |
#extra_xpath_hooks ⇒ Object
77 78 79 |
# File 'lib/traject/nokogiri_reader.rb', line 77 def extra_xpath_hooks @extra_xpath_hooks ||= settings["nokogiri_reader.extra_xpath_hooks"] || {} end |