What is the Infinite Monkeywrench?

The Infinite Monkeywrench (IMW) is a Ruby frameworks to simplify the tasks of acquiring, extracting, transforming, loading, and packaging data. It has the following goals:

  • Minimize programmer time even at the expense of increasing run time.

  • Take data through a full transformation from raw source to packaged purity in as few lines of code as possible.

  • Treat data records as objects as much as possible.

  • Use instead of repeat better code that already exists in other libraries (FasterCSV, I’m talkin’ to you).

  • Make what’s common easy without making what’s uncommon impossible.

  • Work with messy data as well as clean data.

  • Let you incorporate your own tools wherever you choose to.

The Infinite Monkeywrench is a powerful tool but it is not always the right tool. IMW is not designed for

  • Scraping vast amounts of data (use Wuclan and Monkeyshines)

  • Really, really big datasets (use Wukong and Hadoop)

  • Data mining or statistical analysis

  • Visualization

Installation

IMW is hosted on Gemcutter so it’s easy to install.

You’ll have to add http://gemcutter.org to your gem sources if it isn’t there already:

$ gem sources -a http://gemcutter.org

and then install IMW

$ sudo gem install imw

In all the examples that follow it is assumed that you’ve installed IMW and required it in a script via

require 'rubygems'
require 'imw'

Resources

IMW is centered around processing resources. A resource can be anything with a URI and you create one using IMW.open.

csv     = IMW.open('/path/to/my_data.csv')
html    = IMW.open('http://www.example.com/history/march_2007')

IMW dynamically extends a resource with modules appropriate to it when you open it. In the above case, csv would be automatically extended by the IMW::Resources::Formats::Csv module, among others:

csv.modules
=> [IMW::Schemes::Local::Base, IMW::Schemes::Local::LocalFile, IMW::CompressedFiles::Compressible, IMW::Formats::Csv]

while html will use a different set

html.modules
=> [IMW::Schemes::Remote::Base, IMW::Schemes::Remote::RemoteFile, IMW::Schemes::HTTP, IMW::Formats::Html]

Consult the documentation for the modules a resource uses to learn what it can do.

Including/Excluding Resource Modules

You can exercise finer control of the resource modules IMW will extend a given resource with by passing the :as and :without.

IMW.open('http://www.infochimps.com/some_raw_data', :without => [IMW::Formats::Html]).resource_modules
=> [IMW::Schemes::Remote::Base, IMW::Schemes::Remote::RemoteFile, IMW::Schemes::HTTP]

IMW.open('http://www.infochimps.com', :as => [IMW::Formats::Json]).resource_modules
=> [IMW::Schemes::Remote::Base, IMW::Schemes::Remote::RemoteFile, IMW::Schemes::HTTP, IMW::Formats::Json]

You can also pass :no_modules to not use any resource modules.

Handlers and Custom Resource Modules

IMW chooses which resource modules to extend an IMW::Resource by iterating through an array of handlers, passing the resource to the handler, and letting the handler’s response (true/false) determine whether or not to extend the resource with the module accompanying the handler.

You can hook into this process by defining your own handlers. To define a handler which should extend with MyModule any resource with a URI ending with .xxx

IMW::Resource.register_handler MyModule, /\.xxx$/

You can also use a Proc instead of a Regexp for more control. If the result output of the Proc called with a resource is evaluates true then the resource will be extended by MyModule.

IMW::Resource.register_handler MyModule, Proc.new { |resource| resource.is_local? && resource.path =~ /\.xxx$/ }

Manipulating Paths

IMW holds a registry of paths that you can define on the fly or store in a configuration file. Defining paths once in the registry and then referring to them forever after by name helps keep your code flexible as well as portable.

IMW.add_path(:dropbox, "/var/www/public")
IMW.path_to(:dropbox)
=> "/var/www/public"

You can combine named references together dynamically.

IMW.add_path(:raw, :dropbox, "raw")
IMW.path_to(:raw)
=> "/var/www/public/raw"
IMW.path_to(:raw, "my/dataset")
=> "/var/www/public/raw/my/dataset

Altering one path will update others

IMW.add_path(:dropbox, "/data") # redefines :raw
IMW.path_to(:raw, "my/dataset)
=> "/data/raw/my/dataset" # not /var/www/public/raw/my/dataset

Files & Directories

Use IMW.open to open files. The object returned by IMW.open obeys the usual semantics of a File object but it has new methods to manipulate and parse the file.

f1 = IMW.open("/path/to/file")
f1.read() # does what you think

# class methods from File are available
f1.size
f1.writeable?

# use a bang or a 'w' to write
writable_file = IMW.open!('/some/path') # similar to open('/some/path', 'w')

# as well as methods to manipulate the file on the filesystem
f2 = f1.cp("/new/path/to/file") # also try cp_to_dir
f1.exist? # true
f3 = f1.mv("/yet/another/path") # also try mv_to_dir
f1.exist? # false

IMW also knows about directories

d = IMW.open('/tmp')
d.directory? # true
d['*'] # Dir['/tmp/*']
d.mv('/parent/dir')

Remote Files

Many operations defined for files are also defined for arbitrary URIs through the open-uri library.

Files can readily be opened, read, and downloaded from the Internet

site = IMW.open('http://infochimps.org') #=> Recognized as an HTML document
site.read() # does what you think
site.cp('/some/local/path')
site.exist? # will work in many cases

(writing to remote sources isn’t enabled yet).

Archives & Compressed Files

IMW works with a variety of archiving and compression programs to make packaging/unpackaging data easy.

bz2   = IMW.open('/path/to/big_file.bz2')
zip   = IMW.open('/path/to/archive.zip')
targz = IMW.open('/path/to/archive.tar.gz')

IMW recognizes file properties by extension

bz2.is_archive?      # false
bz2.is_compressed?   # true
zip.is_archive?      # true
zip.is_compressed?   # false
targz.is_archive?    # true
targz.is_compressed? # true

# decompress or compress files
big_file = bz2.decompress! # skip the ! to preserve the original
new_bz2  = big_file.compress!

# extract and package archives
zip.extract    # files show up in working directory
tarbz2.extract # no need to decompress first
new_tarbz2 = IMW.open!('/new/archive.tar').create(['/path1', '/path/2']).compress!

Parsing and Emitting Data

IMW encourages you to work with native Ruby data structures as much as possible by providing methods to parse common data formats directly into Arrays, Hashes and Strings.

Some data formats (CSV, JSON, YAML) have a structure which trivially maps to Arrays, Hashes, and Strings and so these formats can immediately be parsed.

Other formats (XML, HTML, flat files, &c.) use data structures which do not map as readily to Arrays, Hashes, and Strings and so these will have to be parsed first.

Ruby-like Data Formats

These include delimited formats such as CSV and TSV as well as “restricted tree-like” formats like JSON and YAML.

For the case of delimited data, consider the following CSV file:

ID,Name,Genus,Species
001,Gray-bellied Night Monkey,Aotus,lemurinus
002,Panamanian Night Monkey,Aotus,zonalis
003,Hernández-Camacho's Night Monkey,Aotus,jorgehernandezi
004,Gray-handed Night Monkey,Aotus,griseimembra
005,Hershkovitz's Night Monkey,Aotus,hershkovitzi
006,Brumback's Night Monkey,Aotus,brumbacki
007,Three-striped Night Monkey,Aotus,trivirgatus
008,Spix's Night Monkey,Aotus,vociferans
009,Malaysian Lar Gibbon,Hylobates,lar lar
010,Carpenter's Lar Gibbon,Hylobates,lar carpenteri

It trivially maps to an Array of Arrays:

data = IMW.open('/path/to/monkeys.csv').load
puts data.class
=> Array
puts data.first.class
=> Array
data.each { |row| puts row.inspect }
=> ["ID", "Name", "Genus", "Species"]
   ["001", "Gray-bellied Night Monkey", "Aotus", "lemurinus"]
   ["002", "Panamanian Night Monkey", "Aotus", "zonalis"]
   ...
   ["010", "Carpenter's Lar Gibbon", "Hylobates", "lar carpenteri"]

Conversely, any array of arrays trivially maps to a delimited file. Here we write out all rows where the genus is Hylobates to a TSV file:

hylobates = data.find_all { |row| row[2] == 'Hylobates' }
hylobates.dump('/path/to/monkeys.tsv')

IMW automatically formats the output as TSV and writes it to the specified path.

Similarly, restricted tree-like formats like JSON and YAML, which map cleanly onto Hashes, Arrays, and Strings, can also be automatically parsed and emitted by IMW.

Consider a YAML version of the above CSV data:

- id: 001
  name: Gray-bellied Night Monkey
  genus: Aotus
  species: lemurinus
- id: 002
  name: Panamanian Night Monkey
  genus: Aotus
  species: zonalis
- id: 003
  name: Hernández-Camacho's Night Monkey
  genus: Aotus
  species: jorgehernandezi
...
- id: 010
  name: Carpenter's Lar Gibbon
  genus: Hylobates
  species: lar carpenteri

This trivially maps to an Array of Hashes and so we can perform the exact same filtration for YAML and JSON as we did for CSV and TSV (in a one-liner!):

data      = IMW.open('/path/to/monkeys.yaml').load
hylobates = data.map{ |monkey| monkey['genus'] == 'Hylobates' }
hylobates.dump('/path/to/monkeys.json')

Resources in these Ruby-like data formats also extend themselves with Enumerable so goodies like map, find_all, &c. are available. This enables converting YAML to JSON with a one-liner:

IMW.open('/path/to/monkeys.yaml').find_all { |monkey| monkey['genus'] == 'Hylobates' }.dump('/path/to/monkeys.json')

Parsing More General Data Formats

Some data formats are structured but do not map readily to Hashes, Arrays, and Strings (XML, HTML, &c.) while other data formats lack structure or have a peculiar structure (flat files in arbitrary syntax).

In both these cases the data needs to be parsed before it’s usable. For the XML and HTML type data formats, IMW uses Hpricot and the IMW::Parsers::HtmlParser for parsing. For flat files, IMW provides the IMW::Parsers::LineParser and the IMW::Parsers::RegexpParser.

HTML files, on the other hand, are more complex and typically have to be parsed before being converted to plain Ruby objects:

# Grab a tiny link from the bottom of Google's homepage
doc = IMW.open('http://www.google.com') # IMW::Files::Html
doc.parse('p a') # 'Privacy'

More complex parsers can also be built

# Grab each row from an HTML table
doc = IMW.open('/path/to/data.html')
doc.parse :employees => ["tr", { :name => "td.name", :address => "td.address" } ]
#=> [{:name => "John Chimpo", :address => "123 Fake St."}, {...}, ... ]

see IMW::Parsers::HtmlParser for details on parsing HTML (and similar) files. Examine the other parsers in IMW::Parsers for details on parsing other data formats.

The IMW Workflow

The workflow of IMW can be roughly summarized as follows:

rip

Data is obtained from a source. IMW allows you to download data from the web, obtain it by querying databases, or use other services like rsync, ftp, &c. to pull it in from another computer.

parse

Data is parsed into Ruby objects and stored.

fix

All the parsed data is combined, reconciled, and further processed into a final form.

package

The data is archived and compressed as necessary and moved to an outbox, staging server, S3 bucket, &c.

Not all datasets

Datasets

Tasks & Dependencies

Directory Structure

Records

IMW on the Command Line

Repositories

Running Tasks