Rodimus

Gem Version Build Status

ETL stands for Extract-Transform-Load. Sometimes, you have data in Source A that needs to be moved to Destination B. Along the way, it needs to be manipulated in some way. This is a common scenario when working with a data warehouse. There are lots of ETL solutions in the wild, but very few of them are open source. None of them (that I know of) are Ruby. So, I started hacking on one for my own use.

Why the name? Rodimus Prime is one of the leaders of the Autobots, and he has a cool name. Naming a data transformation library after a Transformer increases the coolness factor. It's science.

Installation

Add this line to your application's Gemfile:

gem 'rodimus'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rodimus

Usage

tl;dr: See the examples directory for the quickest path to success.

require 'rodimus'
require 'csv'
require 'json'

class CsvInput < Rodimus::Step
  def before_run_set_incoming
    @incoming = CSV.open('examples/worldbank-sample.csv')
    @incoming.readline # skip the headers
  end

  def process_row(row)
    row.to_json
  end
end

class FormattedText < Rodimus::Step
  def before_run_set_stdout
    @outgoing = STDOUT.dup
  end

  def process_row(row)
    data = JSON.parse(row)
    "In #{data.first} during #{data[1]}, CO2 emissions were #{data[2]} metric tons per capita." 
  end
end

t = Rodimus::Transformation.new
s1 = CsvInput.new
s2 = FormattedText.new
t.steps << s1
t.steps << s2
t.run
puts "Transformation complete!"

A transformation is an operation that consists of many steps. Each step may manipulate the data in some way. Typically, the first step is reserved for reading from your data source, and the last step is used to write to the new destination.

In Rodimus, you create a transformation object, and then you add one or more steps to its array of steps. You typically create steps by writing your own classes that inherit from Rodimus::Step. When the transformation is subsequently run, a new process is forked for each step. On platforms that support native threads (JRuby, Rubinius), threads are used instead of forking processes. All processes are connected together using pipes except for the first and last steps (those being the source and destination steps). Each step then consumes rows of data from its incoming pipe and performs some operation on it before writing it to the outgoing pipe.

There are several methods on the Rodimus::Step class that are able to be overridden for custom processing behavior before, during, or after the each row is handled. If those aren't enough, you're also free to manipulate the input/output objects (i.e. to redirect to standard out).

The Rodimus approach is to provide a minimal, flexible framework upon which custom ETL solutions can be built. ETL is complex, and there tend to be many subtle differences between projects which can make things like establishing conventions and encouraging code reuse difficult. Rodimus is an attempt to codify those things which are probably useful to a majority of ETL projects with as little overhead as possible.

If you'd like to know the thought process behind Rodimus, check out this blog post.

Contributing

  1. Fork it ( http://github.com/nevern02/rodimus/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request