turbot-runner

Getting started

git submodule update --init
cd schema && git checkout master && cd ..

Updating the schema

cd schema && git pull --rebase && cd ..
git commit schema -m 'Pull in new schema'

Releasing a new version

Bump the version in lib/turbot_runner/version.rb according to the Semantic Versioning convention, then:

git commit lib/turbot_runner/version.rb -m 'Release new version'
rake release # requires Rubygems credentials

Finally, rebuild the Docker image.

Rough outline of how it works

TurbotRunner is responsible for running a scraper, transforming its data, and then validating and processing any output.

Work is coordinated by an instance of Runner. Most of the interesting work is done in Runner#run_script, which constructs a command like:

python transformer.py >transformer.out 2>transformer.err <scraper.out

This command is then passed to an instance of ScriptRunner which runs the command via system in a new thread. The main thread then monitors the output file, and processes each complete line of output.

A line is processed by an instance of Processor, which checks that the line is valid JSON, and then passes it on to the instance of a subclass of BaseHandler that was passed to the Runner when it was created.

The subclass of BaseHandler can implement any of the following methods:

handle_valid_record
handle_invalid_record
handle_invalid_json
handle_snapshot_ended

If the Processor finds an invalid record, it interrupts the ScriptRunner, and marks the run as having failed.

The Processor will catch an InterruptRun that's raised by handler.handle_valid_record, which will interrupt the ScriptRunner, but will not mark the run as having failed.

When the ScriptRunner is interrupted, it will kill the running process, by sending SIGINT to all the processes in the current process group. The current process is set up (via trap('INT') {} to ignore this.

If the ScriptRunner reads no output from the command within a timeout (by default, 24 hours) it interrupts itself, and marks the run as having failed.

Running the tests

The first two specs to run require some manual input.