Class: Traject::ExperimentalNokogiriStreamingReader::PathTracker

Inherits:
Object
  • Object
show all
Defined in:
lib/traject/experimental_nokogiri_streaming_reader.rb

Overview

initialized with the specification (a very small subset of xpath) for what records to yield-on-each. Tests to see if a Nokogiri::XML::Reader node matches spec.

'//record' or anchored to root: '/body/head/meta' same thing as './body/head/meta' or 'head/meta'

Elements can (and must, to match) have XML namespaces, if and only if they are registered with settings nokogiri.namespaces

sadly JRuby Nokogiri has an incompatibility with true nokogiri, and doesn't preserve our namespaces on outer_xml, so in JRuby we have to track them ourselves, and then also do yet ANOTHER parse in nokogiri. This may make this in Java even LESS performant, I'm afraid.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {}) ⇒ PathTracker

Returns a new instance of PathTracker.



133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 133

def initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {})
  @inverted_namespaces  = namespaces.invert
  @clipboard = clipboard
  # We're guessing using a string will be more efficient than an array
  @current_path         = ""
  @floating             = false

  @path_spec, @floating = parse_path(str_spec)

  @namespaces_stack = []


  @extra_xpath_hooks = extra_xpath_hooks.collect do |path, callable|
    bare_path, floating = parse_path(path)
    {
      path: bare_path,
      floating: floating,
      callable: callable
    }
  end
end

Instance Attribute Details

#clipboardObject (readonly)

Returns the value of attribute clipboard.



132
133
134
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132

def clipboard
  @clipboard
end

#current_pathObject (readonly)

Returns the value of attribute current_path.



132
133
134
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132

def current_path
  @current_path
end

#extra_xpath_hooksObject (readonly)

Returns the value of attribute extra_xpath_hooks.



132
133
134
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132

def extra_xpath_hooks
  @extra_xpath_hooks
end

#inverted_namespacesObject (readonly)

Returns the value of attribute inverted_namespaces.



132
133
134
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132

def inverted_namespaces
  @inverted_namespaces
end

#namespaces_stackObject (readonly)

Returns the value of attribute namespaces_stack.



132
133
134
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132

def namespaces_stack
  @namespaces_stack
end

#path_specObject (readonly)

Returns the value of attribute path_spec.



132
133
134
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132

def path_spec
  @path_spec
end

Instance Method Details

#current_node_docObject



195
196
197
198
199
200
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 195

def current_node_doc
  return nil unless @current_node

  # yeah, sadly we got to have nokogiri parse it again
  fix_namespaces(Nokogiri::XML.parse(@current_node.outer_xml))
end

#fix_namespaces(doc) ⇒ Object

no-op unless it's jruby, and then we use our namespace stack to correctly add namespaces to the Nokogiri::XML::Document, cause in Jruby outer_xml on the Reader doesn't do it for us. :(



241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 241

def fix_namespaces(doc)
  if is_jruby?
    # Only needed in jruby, nokogiri's jruby implementation isn't weird
    # around namespaces in exactly the same way as MRI. We need to keep
    # track of the namespaces in outer contexts ourselves, and then see
    # if they are needed ourselves. :(
    namespaces = namespaces_stack.compact.reduce({}, :merge)
    default_ns = namespaces.delete("xmlns")

    namespaces.each_pair do |attrib, uri|
      ns_prefix = attrib.sub(/\Axmlns:/, '')

      # gotta make sure it's actually used in the doc to not add it
      # unecessarily. GAH.
      if    doc.xpath("//*[starts-with(name(), '#{ns_prefix}:')][1]").empty? &&
            doc.xpath("//@*[starts-with(name(), '#{ns_prefix}:')][1]").empty?
        next
      end
      doc.root.add_namespace_definition(ns_prefix, uri)
    end

    if default_ns
      doc.root.default_namespace = default_ns
      # OMG nokogiri, really?
      default_ns = doc.root.namespace
      doc.xpath("//*[namespace-uri()='']").each do |node|
        node.namespace = default_ns
      end
    end

  end
  return doc
end

#floating?Boolean

Returns:

  • (Boolean)


212
213
214
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 212

def floating?
  !!@floating
end

#is_jruby?Boolean

Returns:

  • (Boolean)


170
171
172
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 170

def is_jruby?
  Traject::Util.is_jruby?
end

#match?Boolean

Returns:

  • (Boolean)


216
217
218
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 216

def match?
  match_path?(path_spec, floating: floating?)
end

#match_path?(path_to_match, floating:) ⇒ Boolean

Returns:

  • (Boolean)


220
221
222
223
224
225
226
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 220

def match_path?(path_to_match, floating:)
  if floating?
    current_path.end_with?(path_to_match)
  else
    current_path == path_to_match
  end
end

#popObject

removes the last slash-separated component from current_path



203
204
205
206
207
208
209
210
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 203

def pop
  current_path.slice!( current_path.rindex('/')..-1 )
  @current_node = nil

  if is_jruby?
    namespaces_stack.pop
  end
end

#push(reader_node) ⇒ Object

adds a component to slash-separated current_path, with namespace prefix.



175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 175

def push(reader_node)
  namespace_prefix = reader_node.namespace_uri && inverted_namespaces[reader_node.namespace_uri]

  # gah, reader_node.name has the namespace prefix in there
  node_name = reader_node.name.gsub(/[^:]+:/, '')

  node_str = if namespace_prefix
    namespace_prefix + ":" + node_name
  else
    reader_node.name
  end

  current_path << ("/" + node_str)

  if is_jruby?
    namespaces_stack << reader_node.namespaces
  end
  @current_node = reader_node
end

#run_extra_xpath_hooksObject



228
229
230
231
232
233
234
235
236
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 228

def run_extra_xpath_hooks
  return unless @current_node

  extra_xpath_hooks.each do |hook_spec|
    if match_path?(hook_spec[:path], floating: hook_spec[:floating])
      hook_spec[:callable].call(current_node_doc, clipboard)
    end
  end
end