data_miner

Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Real-world usage

Brighter Planet logo

We use data_miner for data science at Brighter Planet and in production at

The killer combination for us is:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata - apply corrections in a transparent way
  4. data_miner (this library!) - import data idempotently

Documentation

Check out the extensive documentation.

Quick start

You define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb:

class Country < ActiveRecord::Base
  self.primary_key = 'iso_3166_code'

  # the "col" class method is provided by a different library - active_record_inline_schema
  col :iso_3166_code                            # alpha-2 2-letter like GB
  col :iso_3166_numeric_code, :type => :integer # numeric like 826; aka UN M49 code
  col :iso_3166_alpha_3_code                    # 3-letter like GBR
  col :name

  data_miner do
    # auto_upgrade! is provided by active_record_inline_schema
    process :auto_upgrade!

    import("OpenGeoCode.org's Country Codes to Country Names list",
           :url => 'http://opengeocode.org/download/countrynames.txt',
           :format => :delimited,
           :delimiter => '; ',
           :headers => false,
           :skip => 22) do
      key   :iso_3166_code, :field_number => 0
      store :iso_3166_alpha_3_code, :field_number => 1
      store :iso_3166_numeric_code, :field_number => 2
      store :name, :field_number => 5
    end
  end
end

Now you can run:

>> Country.run_data_miner!
=> nil

More advanced usage

The earth library has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:

Model Highlights Reference
Aircraft parsing Microsoft Frontpage HTML (!) data_miner.rb
Airports forcing column names and use of :select block (Proc) data_miner.rb
Automobile model variants super advanced usage of "custom parser" and errata data_miner.rb
Country parsing CSV and a few other tricks data_miner.rb
EGRID regions parsing XLS data_miner.rb
Flight segment (stage) super advanced usage of POSTing form data data_miner.rb
Zip codes downloading a ZIP file and pulling an XLSX out of it data_miner.rb

And many more - look for the data_miner.rb file that corresponds to each model. Note that you would normally put the data_miner declaration right inside the ActiveRecord model file... it's kept separate in earth so that loading it is optional.

Authors

Wishlist

  • Make the tests real unit tests
  • sql steps shouldn't shell out if binaries are missing

Copyright

Copyright (c) 2013 Seamus Abshere