Description

This gem allows to create concurrent harvesting of records from OAI-PMH repositories, with custom-provided records digestion logic.

Installation

Install the gem and add to the application's Gemfile by executing:

bundle add oai_schedules

If bundler is not being used to manage dependencies, install the gem by executing:

gem install oai_schedules

Usage

require 'oai_schedules/manager'

f_show = lambda do |name, content, records, done, error, state, logger|
  if error.nil?
    # ... do your stuff with records ...
  else
    puts error.message
  end
  puts state
  if done
    puts "done full harvesting"
  end
end

manager = OAISchedules::Manager.new(
  path_dir_state: "./dir_state",
  f_digest: f_show
)
content_schedule = {
  "interval" => "PT2S",
  "active" => true,
  "repository" => {
    "uri" => "https://eudml.org/oai/OAIHandler"
  },
  "format" => "oai_dc",
  "set" => "CEDRAM"
}
manager.add_schedule("my_sample_schedule", content_schedule)
sleep

The demo app above instantiates a schedules manager. It will save schedules internal state file (e.g. OAI resumption token) in folder ./dir_state. Each schedule state files is a JSON file in format state_<name-schedule>.json. The state file is used, in case of the schedules manager crash, to restore the harvesting from the last saved point. A schedule is then added to the manager. The schedule with name my_sample_schedule will get partial records from set CEDRAM in oai_dc (Dublic Core) format, from the repository whose uri is https://eudml.org/oai/OAIHandler, every 2 seconds. At every iteration, every 2 seconds, it will get the partial records by using the previous OAI resumption token (for iterations after the very first one); The code will do this by querying https://eudml.org/oai/OAIHandler?verb=ListRecords&... and adding the necessary query parameters. The custom function provided as f_digest will then be called at each iteration. This will be provided schedule name and content, the partial list of records as a hash, a done flag (full harvesting complete), an error exception if happened, the harvesting state, and the same logger used internally by the schedules manager. and it will write the new one to the state file, until no token is provided (end of the harvesting). As soon as the schedule is added, it is executed. It is possible to add all schedules in advance, then call sleep for infinite event loop. Ot it is also possible to do the following:

# add schedule
manager.add_schedule("my_sample_schedule", content_schedule)

# ... do your own things here ...

# modify schedule
content_schedule["active"] = false  # e.g. pause schedule
manager.modify_schedule("my_sample_schedule", content_schedule)

# .... do other things ...

# modify schedule
content_schedule["active"] = true  # e.g. resume schedule ...
content_schedule["interval"] = "PT5S" # ... but slower
manager.modify_schedule("my_sample_schedule", content_schedule)

# finally remove schedule
manager.remove_schedule("my_sample_schedule")

It is also possible to listen to a folder containing schedule JSON files. These files must have this format: schedule_<name-schedule>.json. The manager will extract the schedule name from the file name. See the following:

manager = OAISchedules::Manager.new(
  # ...
  path_dir_schedules: "./dir_schedules", 
  # ...
)
# # alternative:
# manager = OAISchedules::Manager.new()
# manager.set_listener_dir_schedules("./dir_schedules")
manager.run_listener_dir_schedules(block: true)

The above app will listen to files addition, modification and deletion in the folder ./dir_schedules. The listener is started in blocking mode.

You can add as many schedules are you want (with different names), they will all run concurrently.

A schedule definition, either provided programmatically or from file, must have this structure:

{
  "interval": "PT2S",     // (required) schedule interval in ISO8601 format.
  "active": true,         // (required) is active (resumed) or not (paused)
  "repository": {
    "uri": "https://eudml.org/oai/OAIHandler"     // (required) OAI-PMH repository url
  },
  "format": "oai_dc",     // (required) metadata prefix to use
  "set": "CEDRAM",        // (optional) set to collect. NOTE: on some repositories, this is necessary
  "from": "2025-03-23T00:00:00Z",   // (optional) start datetime to collect from
  "until": "2025-03-23T00:00:00Z",  // (optional) end datetime to collect from
}

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/oai_schedules. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the OaiSchedules project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.