OpeNER Daemons
This Gem makes it possible for OpeNER components to be used as a daemon using Amazon SQS and Amazon S3. SQS is used for job input while S3 is used for storing results. Daemons only take URLs as input, they don't allow text to be specified directly due to size restrictions of SQS (a maximum of 256 KB).
Usage
Create an executable file bin/<component>-daemon, for example
bin/language-identifier-daemon, with the following content:
#!/usr/bin/env ruby
require 'opener/daemons'
controller = Opener::Daemons::Controller.new(
:name => 'opener-<component>',
:exec_path => File.('../../exec/<component>.rb', __FILE__)
)
controller.run
Replace <component> with the name of the component. For example, for the
language identifier this would result in the following:
#!/usr/bin/env ruby
require 'opener/daemons'
controller = Opener::Daemons::Controller.new(
:name => 'opener-language-identifier',
:exec_path => File.('../../exec/language-identifier.rb', __FILE__)
)
controller.run
Next, create an executable file exec/<component>.rb, for example
exec/language-identifier.rb, with the following content:
#!/usr/bin/env ruby
require 'opener/daemons'
require_relative '../lib/opener/<component>'
daemon = Opener::Daemons::Daemon.new(Opener::<constant>)
daemon.start
Replace <component> with the component name, replace <constant> with the
corresponding constant. For example, for the language identifier:
#!/usr/bin/env ruby
require 'opener/daemons'
require_relative '../lib/opener/language_identifier'
daemon = Opener::Daemons::Daemon.new(Opener::LanguageIdentifier)
daemon.start
Extra arguments for the component can be specified as a Hash in the second
argument of the Daemon.new method:
daemon = Opener::Daemons::Daemon.new(Opener::LanguageIdentifier, :kaf => false)
These options will be passed to every individual instance of the component.
Requirements
- A supported Ruby version (see below)
- Amazon SQS
- Amazon S3
- libarchive (for running the tests and such), on Debian/Ubuntu based systems
this can be installed using
sudo apt-get install libarchive-dev
The following Ruby versions are supported:
| Ruby | Required | Recommended |
|---|---|---|
| MRI | >= 1.9.3 | >= 2.1.4 |
| Rubinius | >= 2.2 | >= 2.3.0 |
| JRuby | >= 1.7 | >= 1.7.16 |
Installation
Install it from RubyGems:
gem install opener-daemons
Or using Bundler:
# add this to your Gemfile
gem 'opener-daemons'
# then run this
bundle install
Job Format
Jobs should be serialized as JSON and should adhere to the JSON schema definition schema/sqs_input.json. In short, a job is a JSON object with the following fields:
input_url: the input URLcallbacks: an array of URLsidentifier: a unique identifier to use for the file stored in S3, if no value is given an identifier will be generated automaticallymetadata: an object containing arbitrary metadata, will be passed to every callback URL
An example:
{
"input_url": "http://example.com/my-kaf.xml",
"callbacks": ["http://example.com/my-callback"],
"identifier": "foo123",
"metadata": {
"customer_id": 123
}
}
For more specific details see the schema.
Output
Daemon output is stored in an Amazon S3 bucket, output files are named
<identifier>.xml where <identifier> is the unique identifier of the
document. The content type of these documents is set to application/xml.
Metadata associated with the job (as specified in the metadata field) is saved
as metadata of the S3 object.
Callback URLs will receive the URL of an uploaded document, not the actual content itself. The S3 URLs are only valid for a limited time (currently 1 hour) so callbacks must ensure they can process the input within that time limit.
Monitoring
Components using this Gem can measure performance using New Relic and report errors using Rollbar. To support this the following two environment variables must be set:
NEWRELIC_TOKENROLLBAR_TOKEN
For New Relic the application names will be opener-<component> where
<component> is the component name, as defined by a component itself. If one of
these environment variables is not set the corresponding feature is disabled.
CLI Options
Each daemon takes a set of options that can be used to configure the input
queue, the S3 bucket and so forth. For an up to date list of these options and
their descriptions run a daemon using the --help option.
Some of these options set environment variables that can be used by components, these are as following:
input: sets the input queue in theINPUT_QUEUEvariablethreads: sets the amount of threads to use in theDAEMON_THREADSvariablebucket: sets the S3 bucket to use for output documents in theOUTPUT_BUCKETvariable
Amazon Environment Variables
To properly configure the daemons for Amazon you should set the following environment variables:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGION
If you're running this daemon on an EC2 instance then the first two environment
variables will be set automatically if the instance has an associated IAM
profile. The AWS_REGION variable must always be set.