Module: IMW::Workflow

Includes:
Rake::TaskManager
Included in:
Dataset
Defined in:
lib/imw/dataset/workflow.rb

Overview

IMW encourages you to view a data transformation as a series of interdependent steps.

By default, IMW defines four main steps in such a transformation: rip, parse, fix, and package.

Each step is associated with a directory on disk in which it keeps its files: ripd, prsd, fixd, and pkgd.

The steps are:

rip

Obtain data via HTTP, FTP, SCP, RSYNC, database query, &c and store the results in ripd.

parse

Parse data into a structured form using a library (JSON, YAML, &c.) or using your own parser (XML, flat files, &c.) and store the results in prsd

fix

Combine, filter, reconcile, and transform already structured data into a desired form and store the results in fixd.

package

Archive, compress, and deliver data in its final form to some location (HTTP, FTP, SCP, RSYNC, S3, EBS, &c.), optionally storing the ouptut in pkgd.

Each step depends upon the one before it. The steps are blank by default so there’s no need to write code for steps you don’t need to use. You can also define your own steps (using task just like in Rake) and hook them into these pre-defined steps (or not…).

A dataset also has an :initialize task (which by default just creates the directories for these steps) which you can use to hook in your own initialization tasks by making it depend on them.

A subclass of IMW::Dataset can customize how tasks are defined by overriding define_workflow_tasks, among other methods, and introduce new tasks by overriding define_tasks.

Constant Summary collapse

DEFAULT_OPTIONS =

Default options passed to Rake. Any class including the Rake::TaskManager module must define a constant by this name.

{
  :dry_run => false,
  :trace   => false,
  :verbose => false
}

Instance Method Summary collapse

Instance Method Details

#define_tasksObject

Override this method to define default tasks for a subclass of IMW::Dataset.



106
107
# File 'lib/imw/dataset/workflow.rb', line 106

def define_tasks
end

#file(path, &block) ⇒ IMW::FileTask

Return a new (or existing) IMW::FileTask with the given path. Dependencies can be declared and a block passed in just as in Rake.

Parameters:

Returns:



88
89
90
91
# File 'lib/imw/dataset/workflow.rb', line 88

def file path, &block
  path = path.respond_to?(:path) ? path.path : path
  self.define_task IMW::FileTask, path, &block
end

#file_create(path, &block) ⇒ IMW::FileCreationTask

Return a new (or existing) IMW::FileCreationTask with the given path. Dependencies can be declared and a block passed in just as in Rake.

Parameters:

Returns:



99
100
101
102
# File 'lib/imw/dataset/workflow.rb', line 99

def file_create path, &block
  path = path.respond_to?(:path) ? path.path : path
  self.define_task IMW::FileCreationTask, path, &block
end

#task(deps, &block) ⇒ IMW::Task

Return a new (or existing) IMW::Task with the given name. Dependencies can be declared and a block passed in just as in Rake.

Symbol or String) or the name of the task mapped to an Array of dependencies (if a Hash)

Parameters:

Returns:



78
79
80
# File 'lib/imw/dataset/workflow.rb', line 78

def task deps, &block
  self.define_task IMW::Task, deps, &block
end

#workflow_dirsArray

The steps of the IMW workflow each correspond to a directory in which it is customary that they deposit their files once they are finished processing (so ripped files wind up in the ripd directory, packaged files in the pkgd directory, and so on).

Returns:

  • (Array)

    the workflow directory names



123
124
125
# File 'lib/imw/dataset/workflow.rb', line 123

def workflow_dirs
  [:ripd, :rawd,  :fixd, :pkgd]
end

#workflow_stepsArray

The standard IMW workflow steps.

Returns:

  • (Array)

    the workflow step names



112
113
114
# File 'lib/imw/dataset/workflow.rb', line 112

def workflow_steps
  [:rip,  :parse, :fix, :package]
end