Why?

There is some interest in the bioinformatics community for using rake as a workflow tool (see e.g. this blog post from BioinformaticsZen ).

Rake could be ideal for this type of work: a typical workflow will take data and perform a first set of conversions on it (i.e. a task), followed by a second set of conversions (that is dependent on the first task), and so on. And obviously, bioinformaticians want to keep their data in databases rather than files…

A typical Rakefile could look like this:


 task :001_load_data do
   <load data in database>
 end</p>
task :002_calculate_averages => [:001_load_data] do
<update data in database>
end
task :003_make_histogram_of_averages => [:002_calculate_averages] do
<do stuff>
end
<p>

The trouble is that there is no way yet to check whether a task has to be rerun or not, because there are no timestamps. Regular rake will rerun all three tasks from the example above, regardless if some of them have already been completed.

BioRake adds this task timestamp functionality to rake for working with databases. The functionality needed is very similar to the one available for FileTasks.

So if we had reloaded the data (001), the timestamp for that task in a metadata would be later than the one for task 002. As a result, task 002 would automatically have to be rerun if we were to run task 003.

Implementation

I’ve started to implement an additional type of task, called event. The above snippet from a Rakefile would actually contain


 event :001_load_data do
   ...
 end
 
 event :002_calculated_averages => [:001_load_data] do
   ...
 end

instead of using the task tag.

Similar to a FileTask, timestamps are used to check if certain tasks have to be re-run or not. FileTasks have the advantage that every file has a timestamp. To implement this the metadata of event completion times is stored in the .rake directory inside the current directory.

A event task automatically:

checks the metadata to see if the task has already been run
if so: are there any prerequisites with timestamps that are newer than the task itself?
(re)run the task if necessary
update the metadata

To re-run all tasks from scratch issue a Rake::EventTask.clean or simply


rm -rf .rake

to reset the metadata to before any events have occured.

Status

Even though the tests seem to run and I’ve tried some things out, I can’t guarantee production-level stability (well: call it beta). Use at your own risk.

Sample

The sample/ directory contains an example Rakefile. Suppose a researcher has intensities for a group of individuals on a number of probes. This information should be loaded into a database with the tables individuals, probes and intensities.

As the intensities table contains foreign keys for individual and probe, the individuals and probes tables have to be loaded before the intensities can be loaded.

In rake-speak, this would look like:


event :load_probes do
  <em>load the actual data</em>
end</p>
<p>event :load_individuals do
  <em>load the actual data</em>
end</p>
<p>event :load_intensities => [:load_probes, :load_individuals] do
  <em>load the actual data</em>
end

In a later step, the researcher might want to calculate the average intensity per probe. This would be a new task that depends on the intensities being loaded:


event :calculate_averages => [:load_intensities] do
  _calculate averages and store in probes table_
end

Here, we call the database that will contain the data sample.sqlite3. The metadata about completed events is stored in the .rake directory.

Try a rake -T…