Class: Traject::OaiPmhNokogiriReader

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/traject/oai_pmh_nokogiri_reader.rb

Overview

Reads an OAI feed via HTTP and feeds it directly to a traject pipeline. You don't HAVE to use this to read oai-pmh, you might choose to fetch and store OAI-PMH responses to disk yourself, and then process as ordinary XML.

Example command line:

    traject -i xml -r Traject::OaiPmhNokogiriReader -s oai_pmh.start_url="http://example.com/oai?verb=ListRecords&metadataPrefix=oai_dc" -c your_config.rb

Settings

  • oai_pmh.start_url: Required, eg "http://example.com/oai?verb=ListRecords&metadataPrefix=oai_dc"
  • oai_pmh.timeout: (default 10) timeout for http.rb in seconds
  • oai_pmh.try_gzip: (default true). Ask server for gzip response if available
  • oai_pmh.http_persistent: (default true). Use persistent HTTP connections.

JRUBY NOTES:

  • Does not work with jruby 9.2 until http.rb does: https://github.com/httprb/http/issues/475
  • JRuby version def reads whole http response into memory before parsing; MRI version might do this too, but might not?

TO DO

This would be a lot more useful with some sort of built-in HTTP caching.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input_stream, settings) ⇒ OaiPmhNokogiriReader

Returns a new instance of OaiPmhNokogiriReader.



33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 33

def initialize(input_stream, settings)
  namespaces = (settings["nokogiri.namespaces"] || {}).merge(
    "oai" => "http://www.openarchives.org/OAI/2.0/"
  )


  @settings = Traject::Indexer::Settings.new(
      "nokogiri_reader.extra_xpath_hooks" => extra_xpath_hooks,
      "nokogiri.each_record_xpath" => "/oai:OAI-PMH/oai:ListRecords/oai:record",
      "nokogiri.namespaces" => namespaces
    ).with_defaults(
      "oai_pmh.timeout" => 10,
      "oai_pmh.try_gzip" => true,
      "oai_pmh.http_persistent" => true
    ).fill_in_defaults!.merge(settings)

  @input_stream = input_stream
end

Instance Attribute Details

#input_streamObject (readonly)

Returns the value of attribute input_stream.



31
32
33
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 31

def input_stream
  @input_stream
end

#settingsObject (readonly)

Returns the value of attribute settings.



31
32
33
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 31

def settings
  @settings
end

Instance Method Details

#eachObject



72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 72

def each
  url = start_url

  resumption_token = nil
  last_resumption_token = nil
  pages_fetched = 0

  until url == nil
    resumption_token = read_and_parse_response(url) do |record|
      yield record
    end
    url = resumption_url(resumption_token)
    (last_resumption_token = resumption_token) if resumption_token
    pages_fetched += 1
  end

  logger.info("#{self.class.name}: fetched #{pages_fetched} pages; last resumptionToken found: #{last_resumption_token.inspect}")
end

#extra_xpath_hooksObject



60
61
62
63
64
65
66
67
68
69
70
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 60

def extra_xpath_hooks
  @extra_xpath_hooks ||= {
    "//oai:resumptionToken" =>
      lambda do |doc, clipboard|
        token = doc.text
        if token && token != ""
          clipboard[:resumption_token] = token
        end
      end
  }
end

#loggerObject



105
106
107
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 105

def logger
  @logger ||= (@settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger)
end

#resumption_url(resumption_token) ⇒ Object



91
92
93
94
95
96
97
98
99
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 91

def resumption_url(resumption_token)
  return nil if resumption_token.nil? || resumption_token == ""

  # resumption URL is just original verb with resumption token, that seems to be
  # the oai-pmh spec.
  parsed_uri = URI.parse(start_url)
  parsed_uri.query = "verb=#{CGI.escape start_url_verb}&resumptionToken=#{CGI.escape resumption_token}"
  parsed_uri.to_s
end

#start_urlObject



52
53
54
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 52

def start_url
  settings["oai_pmh.start_url"] or raise ArgumentError.new("#{self.class.name} needs a setting 'oai_pmh.start_url'")
end

#start_url_verbObject



56
57
58
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 56

def start_url_verb
  @start_url_verb ||= (array = CGI.parse(URI.parse(start_url).query)["verb"]) && array.first
end

#timeoutObject



101
102
103
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 101

def timeout
  settings["oai_pmh.timeout"]
end