Module: Krikri::Harvester

Overview

Harvester is the abstract interface for aggregating records from a source. Harvesters need to be able to:

- Enumerate record ids (#record_ids)
- Enumerate records (#records)
- Retrieve individual records (#get_record)

Implementations of Enumerators in subclasses should be lazy, avoiding loading large numbers of records into memory or sending unnecessary requests to providers. The following example should be safe:

my_harvester.record_ids.take(100)

This parent class implements a few generic methods based on the services outlined above:

- #count. This assumes that lazy counting is implemented on the
  Enumerable returned by #record_ids. If not, it is strongly
  recommended to override this method in your subclass with
  an efficient implementation.
- #run. Wraps persistence of each record returned by #records.
  Runs a full harvest, processing the created `record_class`
  instances through `harvest_behavior`, given the options passed
  to #initialize.

When including, add a call to Krikri::Harvester::Registry.register() to put it in the registry so that it can be looked up.

See Also:

Defined Under Namespace

Classes: IdentifierError

Constant Summary collapse

Registry =

An application-wide registry of defined Harvesters.

Class.new(Krikri::Registry)

Instance Attribute Summary collapse

Attributes included from SoftwareAgent

#entity_behavior

Class Method Summary collapse

Instance Method Summary collapse

Methods included from SoftwareAgent

#agent_name

Instance Attribute Details

#nameObject

Returns the value of attribute name.



37
38
39
# File 'lib/krikri/harvester.rb', line 37

def name
  @name
end

#uriObject

Returns the value of attribute uri.



37
38
39
# File 'lib/krikri/harvester.rb', line 37

def uri
  @uri
end

Class Method Details

.expected_optsObject

This method is abstract.

Return initialization options for the harvester. The harvester will expect to receive these upon instantiation (in the opts argument to #initialize), in the form:

key: <symbol for this harvester>,
opts: {
  option_name: {type: :type, required: <boolean>,
                multiple_ok: <boolean, default false>
}

} … where type could be :uri, :string, :int, etc. … and multiple_ok means whether it’s allowed to be an array … for example, for OAI this might be: {key: :oai,

set: {type: :string, required: false, multiple_ok: true},
metadata_prefix: {type: string, required: true}}
TODO:

The actual type token values and how they’ll be used is to be determined, but something should exist for providing validation guidelines to a client so it doesn’t have to have inside knowledge of the harvester’s code.

Note:

The options are going to vary between harvesters. Some options are going to be constant for the whole harvest, and some are going to be lists that get iterated over. For example, a set or collection. There will be an ingestion event where we want multiple jobs enqueued, one per set or collection. The one option (a list) that would vary from harvest job to harvest job might be ‘set’ (in the case of OAI). This method doesn’t solve how that’s going to happen, but simply provides, as a convenience, the options that the harvester wants to see.

Raises:

  • (NotImplementedError)


183
184
185
# File 'lib/krikri/harvester.rb', line 183

def self.expected_opts
  raise NotImplementedError
end

Instance Method Details

#countInteger

Note:

override if #record_ids does not implement a lazy Enumerable#count.

Returns A count of the records expected in the harvest.

Returns:

  • (Integer)

    A count of the records expected in the harvest.



103
104
105
# File 'lib/krikri/harvester.rb', line 103

def count
  record_ids.count
end

#get_record(_) ⇒ Krikri::OriginalRecord

This method is abstract.

Get a single record by identifier.

Parameters:

  • identifier (#to_s)

    the identifier for the record to be retrieved

Returns:

Raises:

  • (NotImplementedError)


119
120
121
# File 'lib/krikri/harvester.rb', line 119

def get_record(_)
  raise NotImplementedError
end

#initialize(opts = {}) ⇒ Object

Accepts options for a generic harvester:

uri: a URI for the harvest endpoint or provider
name: a name for the harvester or provider, SHOULD be supplied when the
      provider does not use universally unique identifiers (optional).
record_class: Class of records to generate (optional; defaults to
              Krikri::OriginalRecord).
id_minter: Module to create identifiers for generated records (optional;
           defaults to Krikri::Md5Minter)
harvest_behavior: A behavior object implementing `#process_record`

Pass harvester specific options to inheriting classes under a key for that harvester. E.g. { uri: my_uri, oai: { metadata_prefix: :oai_dc } }

Parameters:

  • opts (Hash) (defaults to: {})

    a hash of options



70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/krikri/harvester.rb', line 70

def initialize(opts = {})
  @uri = opts.fetch(:uri)
  @name = opts.delete(:name)
  @record_class = opts.delete(:record_class) { Krikri::OriginalRecord }
                  .to_s.constantize
  @id_minter = opts.delete(:id_minter) { Krikri::Md5Minter }
               .to_s.constantize
  @harvest_behavior = opts.delete(:harvest_behavior) do
    Krikri::Harvesters::BasicSaveBehavior
  end.to_s.constantize
  @entity_behavior = self.class.entity_behavior
end

#record_idsEnumerable<String>

This method is abstract.

Provide a low-memory, lazy enumerable for record ids.

The following usage should be safe:

record_ids.each do |id|
   some_operation(id)
end

Returns:

  • (Enumerable<String>)

    The identifiers included in the harvest.

Raises:

  • (NotImplementedError)


95
96
97
# File 'lib/krikri/harvester.rb', line 95

def record_ids
  raise NotImplementedError
end

#recordsEnumerable<Krikri::OriginalRecord>

This method is abstract.

Provide a low-memory, lazy enumerable for records.

Returns The harvested records.

Returns:

Raises:

  • (NotImplementedError)


110
111
112
# File 'lib/krikri/harvester.rb', line 110

def records
  raise NotImplementedError
end

#run(activity_uri = nil) ⇒ Boolean

Run the harvest.

Individual records are processed through ‘#process_record` which is delegated to the harvester’s ‘@harvest_behavior` by default.

Returns:

  • (Boolean)

See Also:

  • Krirki::Harvesters:HarvestBehavior


131
132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/krikri/harvester.rb', line 131

def run(activity_uri = nil)
  records.each do |rec|
    next if rec.nil?
    begin
      process_record(rec, activity_uri)
    rescue => e
      Krikri::Logger.log :error, "Error harvesting record:\n" \
                                 "#{rec.content}\n\twith message:\n"\
                                 "#{e.message}"
      next
    end
  end
  true
end