Class: Traject::ExperimentalNokogiriStreamingReader

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/traject/experimental_nokogiri_streaming_reader.rb

Overview

An EXPERIMENTAL HALF-FINISHED implementation of a streaming/pull reader using Nokogiri. Not ready for use, not stable API, could go away.

This was my first try at a NokogiriReader implementation, it didn't work out, at least without a lot more work. I think we'd need to re-do it to build the Nokogiri::XML::Nodes by hand as the source is traversed, instead of relying on #outer_xml -- outer_xml returning a string results in a double-parsing, with the expected 50% performance hit. Picadillos in Nokogiri JRuby namespace handling don't help.

All in all, it's possible something could be gotten here with a lot more work, it's also possible Nokogiri's antipathy to namespaces could keep getting in the way.

Defined Under Namespace

Classes: PathTracker

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input_stream, settings) ⇒ ExperimentalNokogiriStreamingReader

Returns a new instance of ExperimentalNokogiriStreamingReader.



17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 17

def initialize(input_stream, settings)
  @settings = Traject::Indexer::Settings.new settings
  @input_stream = input_stream
  @clipboard = Traject::Util.is_jruby? ? Concurrent::Map.new : Concurrent::Hash.new

  if each_record_xpath
    @path_tracker = PathTracker.new(each_record_xpath,
                                      clipboard: self.clipboard,
                                      namespaces: default_namespaces,
                                      extra_xpath_hooks: extra_xpath_hooks)
  end

  default_namespaces # trigger validation
  validate_limited_xpath(each_record_xpath, key_name: "each_record_xpath")

end

Instance Attribute Details

#clipboardObject (readonly)

Returns the value of attribute clipboard.



15
16
17
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15

def clipboard
  @clipboard
end

#input_streamObject (readonly)

Returns the value of attribute input_stream.



15
16
17
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15

def input_stream
  @input_stream
end

#path_trackerObject (readonly)

Returns the value of attribute path_tracker.



15
16
17
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15

def path_tracker
  @path_tracker
end

#settingsObject (readonly)

Returns the value of attribute settings.



15
16
17
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15

def settings
  @settings
end

Instance Method Details

#default_namespacesObject



77
78
79
80
81
82
83
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 77

def default_namespaces
  @default_namespaces ||= (settings["nokogiri.namespaces"] || {}).tap { |ns|
    unless ns.kind_of?(Hash)
      raise ArgumentError, "nokogiri.namespaces must be a hash, not: #{ns.inspect}"
    end
  }
end

#eachObject



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 85

def each
  unless each_record_xpath
    # forget streaming, just read it and return it once, done.
    yield Nokogiri::XML.parse(input_stream)
    return
  end

  reader = Nokogiri::XML::Reader(input_stream)

  reader.each do |reader_node|
    if reader_node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
      path_tracker.push(reader_node)

      if path_tracker.match?
        yield path_tracker.current_node_doc
      end
      path_tracker.run_extra_xpath_hooks

      if reader_node.self_closing?
        path_tracker.pop
      end
    end

    if reader_node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
      path_tracker.pop
    end
  end
end

#each_record_xpathObject



34
35
36
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 34

def each_record_xpath
  @each_record_xpath ||= settings["nokogiri.each_record_xpath"]
end

#extra_xpath_hooksObject



38
39
40
41
42
43
44
45
46
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 38

def extra_xpath_hooks
  @extra_xpath_hooks ||= begin
    (settings["nokogiri_reader.extra_xpath_hooks"] || {}).tap do |hash|
      hash.each_pair do |limited_xpath, callable|
        validate_limited_xpath(limited_xpath, key_name: "nokogiri_reader.extra_xpath_hooks")
      end
    end
  end
end