Description
This gem allows to create concurrent harvesting of records from OAI-PMH repositories, with custom-provided records digestion logic.
Installation
Install the gem and add to the application's Gemfile by executing:
bundle add oai_schedules
If bundler is not being used to manage dependencies, install the gem by executing:
gem install oai_schedules
Usage
require 'oai_schedules/manager'
f_show = lambda do |name, content, records, done, error, state, logger|
if error.nil?
# ... do your stuff with records ...
else
puts error.
end
puts state
if done
puts "done full harvesting"
end
end
manager = OAISchedules::Manager.new(
path_dir_state: "./dir_state",
f_digest: f_show
)
content_schedule = {
"interval" => "PT2S",
"active" => true,
"repository" => {
"uri" => "https://eudml.org/oai/OAIHandler"
},
"format" => "oai_dc",
"set" => "CEDRAM"
}
manager.add_schedule("my_sample_schedule", content_schedule)
sleep
The demo app above instantiates a schedules manager.
It will save schedules internal state file (e.g. OAI resumption token) in folder ./dir_state
.
Each schedule state files is a JSON file in format state_<name-schedule>.json.
The state file is used, in case of the schedules manager crash, to restore the harvesting from the last saved point.
A schedule is then added to the manager.
The schedule with name my_sample_schedule
will get partial records from set CEDRAM
in oai_dc
(Dublic Core) format,
from the repository whose uri is https://eudml.org/oai/OAIHandler
, every 2 seconds.
At every iteration, every 2 seconds, it will get the partial records by using the previous OAI resumption token
(for iterations after the very first one);
The code will do this by querying https://eudml.org/oai/OAIHandler?verb=ListRecords&...
and adding the necessary
query parameters.
The custom function provided as f_digest
will then be called at each iteration.
This will be provided schedule name
and content
, the partial list of records
as a hash,
a done
flag (full harvesting complete), an error
exception if happened, the harvesting state
,
and the same logger used internally by the schedules manager.
and it will write the new one to the state file, until no token is provided (end of the harvesting).
As soon as the schedule is added, it is executed.
It is possible to add all schedules in advance, then call sleep
for infinite event loop.
Ot it is also possible to do the following:
# add schedule
manager.add_schedule("my_sample_schedule", content_schedule)
# ... do your own things here ...
# modify schedule
content_schedule["active"] = false # e.g. pause schedule
manager.modify_schedule("my_sample_schedule", content_schedule)
# .... do other things ...
# modify schedule
content_schedule["active"] = true # e.g. resume schedule ...
content_schedule["interval"] = "PT5S" # ... but slower
manager.modify_schedule("my_sample_schedule", content_schedule)
# finally remove schedule
manager.remove_schedule("my_sample_schedule")
It is also possible to listen to a folder containing schedule JSON files. These files must have this format: schedule_<name-schedule>.json. The manager will extract the schedule name from the file name. See the following:
manager = OAISchedules::Manager.new(
# ...
path_dir_schedules: "./dir_schedules",
# ...
)
# # alternative:
# manager = OAISchedules::Manager.new()
# manager.set_listener_dir_schedules("./dir_schedules")
manager.run_listener_dir_schedules(block: true)
The above app will listen to files addition, modification and deletion in the folder ./dir_schedules
.
The listener is started in blocking mode.
You can add as many schedules are you want (with different names), they will all run concurrently.
A schedule definition, either provided programmatically or from file, must have this structure:
{
"interval": "PT2S", // (required) schedule interval in ISO8601 format.
"active": true, // (required) is active (resumed) or not (paused)
"repository": {
"uri": "https://eudml.org/oai/OAIHandler" // (required) OAI-PMH repository url
},
"format": "oai_dc", // (required) metadata prefix to use
"set": "CEDRAM", // (optional) set to collect. NOTE: on some repositories, this is necessary
"from": "2025-03-23T00:00:00Z", // (optional) start datetime to collect from
"until": "2025-03-23T00:00:00Z", // (optional) end datetime to collect from
}
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/oai_schedules. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
License
The gem is available as open source under the terms of the MIT License.
Code of Conduct
Everyone interacting in the OaiSchedules project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.