Class: GeoCombine::GeoBlacklightHarvester

Inherits:
Object
  • Object
show all
Defined in:
lib/geo_combine/geo_blacklight_harvester.rb

Overview

A class to harvest and index results from GeoBlacklight sites You can configure the sites to be harvested via a configure command. GeoCombine::GeoBlacklightHarvester.configure do

{
  SITE: { host: 'https://example.com', params: { f: { dct_provenance_s: ['SITE'] } } }
}

end The class configuration also allows for various other things to be configured:

- A debug parameter to print out details of what is being harvested and indexed
- crawl delays for each page of results (globally or on a per site basis)
- Solr's commitWithin parameter (defaults to 5000)
- A document transformer proc to modify a document before indexing (defaults to removing _version_, score, and timestamp)

Example: GeoCombine::GeoBlacklightHarvester.new('SITE').index

Defined Under Namespace

Classes: BlacklightResponseVersionFactory, LegacyBlacklightResponse, ModernBlacklightResponse

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(site_key) ⇒ GeoBlacklightHarvester

Returns a new instance of GeoBlacklightHarvester.

Raises:

  • (ArgumentError)

44
45
46
47
48
49
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 44

def initialize(site_key)
  @site_key = site_key
  @site = self.class.config[site_key]

  raise ArgumentError, "Site key #{@site_key.inspect} is not configured for #{self.class.name}" unless @site
end

Class Attribute Details

.document_transformerObject


32
33
34
35
36
37
38
39
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 32

def document_transformer
  @document_transformer || ->(document) do
    document.delete('_version_')
    document.delete('score')
    document.delete('timestamp')
    document
  end
end

Instance Attribute Details

#siteObject (readonly)

Returns the value of attribute site


43
44
45
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 43

def site
  @site
end

#site_keyObject (readonly)

Returns the value of attribute site_key


43
44
45
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 43

def site_key
  @site_key
end

Class Method Details

.configObject


28
29
30
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 28

def config
  @config || {}
end

.configure(&block) ⇒ Object


24
25
26
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 24

def configure(&block)
  @config = yield block
end

Instance Method Details

#indexObject


51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/geo_combine/geo_blacklight_harvester.rb', line 51

def index
  puts "Fetching page 1 @ #{base_url}&page=1" if self.class.config[:debug]
  response = JSON.parse(Net::HTTP.get(URI("#{base_url}&page=1")))
  response_class = BlacklightResponseVersionFactory.call(response)

  response_class.new(response: response, base_url: base_url).documents.each do |docs|
    docs.map! do |document|
      self.class.document_transformer.call(document) if self.class.document_transformer
    end.compact

    puts "Adding #{docs.count} documents to solr" if self.class.config[:debug]
    solr_connection.update params: { commitWithin: commit_within, overwrite: true },
                           data: docs.to_json,
                           headers: { 'Content-Type' => 'application/json' }

    sleep(crawl_delay.to_i) if crawl_delay
  end
end