Lines of Code Code Status Dependency Status Build Status Coverage Status Downloads

Goldmine

Extract a wealth of information from lists.

Goldmine is especially helpful when working with source data that is difficult to query. e.g. CSV files, API results, etc...

Uses

  • Data mining
  • Data transformation
  • Data blending
  • Data visualization prep
  • CSV report generation

Quick Start

gem install goldmine
require "goldmine"
list = [1,2,3,4,5,6,7,8,9]

Goldmine(list)
  .pivot("< 5") { |i| i < 5 }
  .to_h
{
  [["< 5", true]]  => [1, 2, 3, 4],
  [["< 5", false]] => [5, 6, 7, 8, 9]
}

Array Value Pivots

users = [
  { :name => "Sally",   :favorite_colors => [:blue] },
  { :name => "John",    :favorite_colors => [:blue, :green] },
  { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
  { :name => "Emily",   :favorite_colors => [:orange, :green] },
  { :name => "Joe",     :favorite_colors => [:red] }
]

Goldmine(users)
  .pivot(:favorite_color) { |record| record[:favorite_colors] }
  .to_h
{
  [:favorite_color, :blue]   => [{:name=>"Sally", :favorite_colors=>[:blue]}, {:name=>"John", :favorite_colors=>[:blue, :green]}],
  [:favorite_color, :green]  => [{:name=>"John", :favorite_colors=>[:blue, :green]}, {:name=>"Emily", :favorite_colors=>[:orange, :green]}],
  [:favorite_color, :red]    => [{:name=>"Stephen", :favorite_colors=>[:red, :pink, :purple]}, {:name=>"Joe", :favorite_colors=>[:red]}],
  [:favorite_color, :pink]   => [{:name=>"Stephen", :favorite_colors=>[:red, :pink, :purple]}],
  [:favorite_color, :purple] => [{:name=>"Stephen", :favorite_colors=>[:red, :pink, :purple]}],
  [:favorite_color, :orange] => [{:name=>"Emily", :favorite_colors=>[:orange, :green]}]
}

Chained pivots

users = [
  { :name => "Sally",   :age => 21 },
  { :name => "John",    :age => 28 },
  { :name => "Stephen", :age => 37 },
  { :name => "Emily",   :age => 32 },
  { :name => "Joe",     :age => 18 }
]

Goldmine(users)
  .pivot("'e' in name") { |user| !!user[:name].match(/e/i) }
  .pivot("21 or over") { |user| user[:age] >= 21 }
  .to_h
{
  [["'e' in name", false], ["21 or over", true]]  => [{:name=>"Sally", :age=>21}, {:name=>"John", :age=>28}],
  [["'e' in name", true],  ["21 or over", true]]  => [{:name=>"Stephen", :age=>37}, {:name=>"Emily", :age=>32}],
  [["'e' in name", true],  ["21 or over", false]] => [{:name=>"Joe", :age=>18}]
}

Rollups

Rollups provide an intuitive way to aggregate pivoted data into a report friendly format. Think computed columns.

Rollups are blocks that get executed once for each pivot entry. They can be also be chained.

list = [1,2,3,4,5,6,7,8,9]

Goldmine(list)
  .pivot("< 5") { |i| i < 5 }
  .pivot("even") { |i| i % 2 == 0 }
  .rollup("count", &:count)
  .to_h
{
  [["< 5", true],  ["even", false]] => [["count", 2]],
  [["< 5", true],  ["even", true]]  => [["count", 2]],
  [["< 5", false], ["even", false]] => [["count", 3]],
  [["< 5", false], ["even", true]]  => [["count", 2]]
}

Rollup Caching

Rollups can be computationally expensive. Optional caching can be used to reduce this computational overhead.

list = [1,2,3,4,5,6,7,8,9]

Goldmine(list, cache: true)
  .pivot(:less_than_5) { |i| i < 5 }
  .rollup(:count, &:count)
  .rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.count }
  .rollup(:even_percentage) { |list| cache[:evens] / cache[:count].to_f }
  .to_h
{
  [[:less_than_5, true]]  => [[:count, 4], [:evens, 2], [:even_percentage, 0.5]],
  [[:less_than_5, false]] => [[:count, 5], [:evens, 2], [:even_percentage, 0.4]]
}

Rows

It's often helpful to flatten rollups into rows.

list = [1,2,3,4,5,6,7,8,9]

result = Goldmine(list, cache: true)
  .pivot(:less_than_5) { |i| i < 5 }
  .rollup(:count, &:count)
  .rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.count }
  .rollup(:even_percentage) { |list| cache[:evens] / cache[:count].to_f }
  .result
result.to_rows
[
  [[:less_than_5, true], [:count, 4], [:evens, 2], [:even_percentage, 0.5]],
  [[:less_than_5, false], [:count, 5], [:evens, 2], [:even_percentage, 0.4]]
]
result.to_hash_rows
[
  {:less_than_5=>true, :count=>4, :evens=>2, :even_percentage=>0.5},
  {:less_than_5=>false, :count=>5, :evens=>2, :even_percentage=>0.4}
]

Tabular

Rollups can also be converted into tabular format.

list = [1,2,3,4,5,6,7,8,9]

Goldmine(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count, &:count)
  .to_tabular
[
  [:less_than_5, :even, :count],
  [true, false, 2],
  [true, true, 2],
  [false, false, 3],
  [false, true, 2]
]

CSV

Rollups can also be converted into CSV format.

list = [1,2,3,4,5,6,7,8,9]

Goldmine(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count, &:count)
  .to_csv
"less_than_5,even,count\ntrue,false,2\ntrue,true,2\nfalse,false,3\nfalse,true,2\n"

Example Apps

All examples are small Sinatra apps. They are designed to help communicate Goldmine use-cases.

Setup

git clone [email protected]:hopsoft/goldmine.git
cd /path/to/goldmine
bundle

New York Wifi Hotspots

Uses data from https://github.com/hopsoft/goldmine/blob/master/examples/new_york_wifi_hotspots/DOITT_WIFI_HOTSPOT_01_13SEPT2010.csv

In this example, we mine out the following information.

  • Total hotspots by city, zip, & area code
  • Free hotspots by city, zip, & area code
  • Paid hotspots by city, zip, & area code
  • Library hotspots by city, zip, & area code
  • Starbucks hotspots by city, zip, & area code
  • McDonalds hotspots by city, zip, & area code
ruby examples/new_york_wifi_hotspots/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv

Medicare Physician Comparison

Uses data from http://dev.socrata.com/foundry/#/data.medicare.gov/aeay-dfax

In this example, we mine out the following information.

  • Total doctors by state & specialty
  • Preferred doctors by state & specialty
  • Female doctors by state & specialty
  • Male doctors by state & specialty
  • Preferred female doctors by state & specialty
  • Preferred male doctors by state & specialty
ruby examples/medicare_physician_compare/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv

Performance

The Medicare dataset is large & works well for performance testing.

My Macbook Pro yields the following benchmarks.

  • 3.1 GHz Intel Core i7
  • 16 GB 1867 MHz DDR3
100,000 Records
                      user     system      total        real
pivoted           0.630000   0.030000   0.660000 (  0.670409)
rolled_up         0.570000   0.030000   0.600000 (  0.626413)
rows              0.010000   0.000000   0.010000 (  0.003258)
tabular           0.010000   0.000000   0.010000 (  0.010110)
csv               0.050000   0.000000   0.050000 (  0.057677)
1,000,000 Records
                      user     system      total        real
pivoted           7.270000   0.300000   7.570000 (  8.053166)
rolled_up         6.800000   0.830000   7.630000 (  8.051707)
rows              0.000000   0.000000   0.000000 (  0.003934)
tabular           0.010000   0.000000   0.010000 (  0.011825)
csv               0.210000   0.010000   0.220000 (  0.222752)

Summary

Goldmine makes data highly malleable. It allows you to combine the power of pivots, rollups, tabular data, & csv to construct deep insights with minimal effort.

Real world use cases include:

  • Build a better understanding of database data before canonizing reports in SQL
  • Create source data for building user interfaces & data visualizations
  • Transform CSV data from one format to another