Module: Krikri::Harvester
Overview
Harvester is the abstract interface for aggregating records from a source. Harvesters need to be able to:
- Enumerate record ids (#record_ids)
- Enumerate records (#records)
- Retrieve individual records (#get_record)
Implementations of Enumerators in subclasses should be lazy, avoiding loading large numbers of records into memory or sending unnecessary requests to providers. The following example should be safe:
my_harvester.record_ids.take(100)
This parent class implements a few generic methods based on the services outlined above:
- #count. This assumes that lazy counting is implemented on the
Enumerable returned by #record_ids. If not, it is strongly
recommended to override this method in your subclass with
an efficient implementation.
- #run. Wraps persistence of each record returned by #records.
Runs a full harvest, inserting Original Records into the database,
given the options passed to #initialize.
When including, add a call to Krikri::Harvester::Registry.register() to put it in the registry so that it can be looked up. See lib/krikri/engine.rb.
Constant Summary collapse
Constants included from SoftwareAgent
Instance Attribute Summary collapse
-
#name ⇒ Object
Returns the value of attribute name.
-
#uri ⇒ Object
Returns the value of attribute uri.
Class Method Summary collapse
- .expected_opts ⇒ Object abstract
Instance Method Summary collapse
-
#count ⇒ Integer
A count of the records expected in the harvest.
- #get_record(_) ⇒ Krikri::OriginalRecord abstract
-
#initialize(opts = {}) ⇒ Object
Accepts options for a generic harvester: uri: a URI for the harvest endpoint or provider name: a name for the harvester or provider, SHOULD be supplied when the provider does not use universally unique identifiers (optional).
-
#record_ids ⇒ Enumerable<String>
abstract
The following usage should be safe:.
-
#records ⇒ Enumerable<Krikri::OriginalRecord>
abstract
The harvested records.
-
#run(activity_uri = nil) ⇒ Boolean
Run the harvest.
Methods included from SoftwareAgent
Instance Attribute Details
#name ⇒ Object
Returns the value of attribute name.
37 38 39 |
# File 'lib/krikri/harvester.rb', line 37 def name @name end |
#uri ⇒ Object
Returns the value of attribute uri.
37 38 39 |
# File 'lib/krikri/harvester.rb', line 37 def uri @uri end |
Class Method Details
.expected_opts ⇒ Object
Return initialization options for the harvester. The harvester will expect to receive these upon instantiation (in the opts argument to #initialize), in the form:
key: <symbol for this harvester>,
opts: {
option_name: {type: :type, required: <boolean>,
multiple_ok: <boolean, default false>
}
} … where type could be :uri, :string, :int, etc. … and multiple_ok means whether it’s allowed to be an array … for example, for OAI this might be: {key: :oai,
set: {type: :string, required: false, multiple_ok: true},
metadata_prefix: {type: string, required: true}}
The actual type token values and how they’ll be used is to be determined, but something should exist for providing validation guidelines to a client so it doesn’t have to have inside knowledge of the harvester’s code.
The options are going to vary between harvesters. Some options are going to be constant for the whole harvest, and some are going to be lists that get iterated over. For example, a set or collection. There will be an ingestion event where we want multiple jobs enqueued, one per set or collection. The one option (a list) that would vary from harvest job to harvest job might be ‘set’ (in the case of OAI). This method doesn’t solve how that’s going to happen, but simply provides, as a convenience, the options that the harvester wants to see.
147 148 149 |
# File 'lib/krikri/harvester.rb', line 147 def self.expected_opts raise NotImplementedError end |
Instance Method Details
#count ⇒ Integer
override if #record_ids does not implement a lazy Enumerable#count.
Returns A count of the records expected in the harvest.
78 79 80 |
# File 'lib/krikri/harvester.rb', line 78 def count record_ids.count end |
#get_record(_) ⇒ Krikri::OriginalRecord
Get a single record by identifier.
94 95 96 |
# File 'lib/krikri/harvester.rb', line 94 def get_record(_) raise NotImplementedError end |
#initialize(opts = {}) ⇒ Object
Accepts options for a generic harvester:
uri: a URI for the harvest endpoint or provider
name: a name for the harvester or provider, SHOULD be supplied when the
provider does not use universally unique identifiers (optional).
record_class: Class of records to generate (optional; defaults to
Krikri::OriginalRecord).
id_minter: Module to create identifiers for generated records (optional;
defaults to Krikri::Md5Minter)
Pass harvester specific options to inheriting classes under a key for that harvester. E.g. { uri: my_uri, oai: { metadata_prefix: :oai_dc } }
53 54 55 56 57 58 |
# File 'lib/krikri/harvester.rb', line 53 def initialize(opts = {}) @uri = opts.fetch(:uri) @name = opts.delete(:name) @record_class = opts.delete(:record_class) { Krikri::OriginalRecord } @id_minter = opts.delete(:id_minter) { Krikri::Md5Minter } end |
#record_ids ⇒ Enumerable<String>
Provide a low-memory, lazy enumerable for record ids.
The following usage should be safe:
record_ids.each do |id|
some_operation(id)
end
70 71 72 |
# File 'lib/krikri/harvester.rb', line 70 def record_ids raise NotImplementedError end |
#records ⇒ Enumerable<Krikri::OriginalRecord>
Provide a low-memory, lazy enumerable for records.
Returns The harvested records.
85 86 87 |
# File 'lib/krikri/harvester.rb', line 85 def records raise NotImplementedError end |
#run(activity_uri = nil) ⇒ Boolean
Run the harvest. This should be idempotent so it can be safely retried on errors.
103 104 105 106 107 108 |
# File 'lib/krikri/harvester.rb', line 103 def run(activity_uri = nil) log :info, 'harvest is running' records.each { |rec| rec.save(activity_uri) } log :info, 'harvest is done' true end |