data_miner

Mine remote data into your ActiveRecord models.

Quick start

Put this in config/environment.rb:

config.gem 'data_miner'

You need to define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb:

class Country < ActiveRecord::Base
  set_primary_key :iso_3166

  data_miner do
    import 'The official ISO country list', :url => 'http://www.iso.org/iso/list-en1-semic-3.txt', :skip => 2, :headers => false, :delimiter => ';' do
      key 'iso_3166'
      store 'iso_3166', :field_number => 1
      store 'name', :field_number => 0
    end

    import 'A Princeton dataset with better capitalization for some countries', :url => 'http://www.cs.princeton.edu/introcs/data/iso3166.csv' do
      key 'iso_3166'
      store 'iso_3166', :field_name => 'country code'
      store 'name', :field_name => 'country'
    end
  end
end

…and in app/models/airport.rb:

class Airport < ActiveRecord::Base
  set_primary_key :iata_code

  data_miner do
    import :url => 'http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat', :headers => false, :select => lambda { |row| row[4].present? } do
      key 'iata_code'
      store 'name', :field_number => 1
      store 'city', :field_number => 2
      store 'country_name', :field_number => 3
      store 'iata_code', :field_number => 4
      store 'latitude', :field_number => 6
      store 'longitude', :field_number => 7
    end
  end
end

Put this in lib/tasks/data_miner_tasks.rake: (unfortunately I don’t know a way to automatically include gem tasks, so you have to do this manually for now)

namespace :data_miner do
  task :run => :environment do
    resource_names = %w{R RESOURCES RESOURCE RESOURCE_NAMES}.map { |possible_key| ENV[possible_key].to_s }.join.split(/\s*,\s*/).flatten.compact
    DataMiner.run :resource_names => resource_names
  end
end

Once you have (1) set up the order of data mining and (2) defined data_miner blocks in your classes, you can:

$ rake data_miner:run RESOURCES=Airport,Country

Complete example

~ $ rails testapp
~ $ cd testapp/
~/testapp $ ./script/generate model Airport iata_code:string name:string city:string country_name:string latitude:float longitude:float
[...edit migration to make iata_code the primary key...]
~/testapp $ ./script/generate model Country iso_3166:string name:string
[...edit migration to make iso_3166 the primary key...]
~/testapp $ rake db:migrate
~/testapp $ touch lib/tasks/data_miner_tasks.rb
[...edit per quick start...]
~/testapp $ rake data_miner:run RESOURCES=Airport,Country

Now you should have

~/testapp $ ./script/console 
Loading development environment (Rails 2.3.3)
>> Airport.first.iata_code
=> "GKA"
>> Airport.first.country_name
=> "Papua New Guinea"

Authors

Copyright © 2010 Brighter Planet. See LICENSE for details.