Goldmine

Extract a wealth of information from Arrays & Hashes.

Goldmine is especially helpful when working with source data that is difficult to query. e.g. CSV files, API results, etc...

Uses

Data mining
Data transformation
Data blending
Data visualization prep
CSV report generation

Quick Start

gem install goldmine

require "goldmine"

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot { |i| i < 5 }
# result:
{
  true  => [1, 2, 3, 4],
  false => [5, 6, 7, 8, 9]
}

Chained Pivots

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot { |i| i < 5 }
  .pivot { |i| i % 2 == 0 }
# result:
{
  [true, false]  => [1, 3],
  [true, true]   => [2, 4],
  [false, false] => [5, 7, 9],
  [false, true]  => [6, 8]
}

Named Pivots

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
# result:
{
  { :less_than_5 => true }  => [1, 2, 3, 4],
  { :less_than_5 => false } => [5, 6, 7, 8, 9]
}

Value Pivots

list = [
  { :name => "Sally",   :favorite_colors => [:blue] },
  { :name => "John",    :favorite_colors => [:blue, :green] },
  { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
  { :name => "Emily",   :favorite_colors => [:orange, :green] },
  { :name => "Joe",     :favorite_colors => [:red] }
]
list = Goldmine::ArrayMiner.new(list)
list.pivot { |record| record[:favorite_colors] }
# result:
{
  :blue => [
    { :name => "Sally", :favorite_colors => [:blue] },
    { :name => "John",  :favorite_colors => [:blue, :green] }
  ],
  :green => [
    { :name => "John",  :favorite_colors => [:blue, :green] },
    { :name => "Emily", :favorite_colors => [:orange, :green] }
  ],
  :red => [
    { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
    { :name => "Joe",     :favorite_colors => [:red] }
  ],
  :pink => [
    { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] }
  ],
  :purple => [
    { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] }
  ],
  :orange => [
    { :name => "Emily", :favorite_colors => [:orange, :green] }
  ]
}

Stacked pivots

list = [
  { :name => "Sally",   :age => 21 },
  { :name => "John",    :age => 28 },
  { :name => "Stephen", :age => 37 },
  { :name => "Emily",   :age => 32 },
  { :name => "Joe",     :age => 18 }
]
list = Goldmine::ArrayMiner.new(list)
mined = list.pivot("Name has an 'e'") do |record|
  !!record[:name].match(/e/i)
end
mined = mined.pivot(">= 21 years old") do |record|
  record[:age] >= 21
end
# result:
{
  { "Name has an 'e'" => false, ">= 21 years old" => true } => [
    { :name => "Sally", :age => 21 },
    { :name => "John",  :age => 28 }
  ],
  { "Name has an 'e'" => true, ">= 21 years old" => true } => [
    { :name => "Stephen", :age => 37 },
    { :name => "Emily",   :age => 32 }
  ],
  { "Name has an 'e'" => true, ">= 21 years old" => false } => [
    { :name => "Joe", :age => 18 }
  ]
}

Rollups

Rollups provide a clean way to aggregate pivoted data... think computed columns.

Rollup blocks are executed once for each pivot. Like pivots, rollups can be chained.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count) { |matched| matched.size }
# result:
{
  { :less_than_5 => true, :even => false }  => { :count => 2 },
  { :less_than_5 => true, :even => true }   => { :count => 2 },
  { :less_than_5 => false, :even => false } => { :count => 3 },
  { :less_than_5 => false, :even => true }  => { :count => 2 }
}

Pre-Computed Results

Rollups can be computationally expensive (depending upon how much logic you stuff into the block). Goldmine caches rollup results & makes them available to subsequent rollups.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .rollup(:count, &:size)
  .rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.size }
  .rollup(:even_percentage) { |list|
    computed(:evens).for(list) / computed(:count).for(list).to_f
  }
# result:
{
  { :less_than_5 => true } => { :count => 4, :evens => 2, :even_percentage => 0.5 },
  { :less_than_5 => false } => { :count => 5, :evens => 2, :even_percentage => 0.4 }
}

Rows

It's often helpful to flatten rollups into rows.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .rollup(:count, &:size)
  .rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.size }
  .rollup(:even_percentage) { |list|
    computed(:evens).for(list) / computed(:count).for(list).to_f
  }
  .to_rows
# result:
[
  { "less_than_5" => true, "count" => 4, "evens" => 2, "even_percentage" => 0.5 },
  { "less_than_5" => false, "count" => 5, "evens" => 2, "even_percentage" => 0.4 }
]

Tabular

Rollups can also be converted into tabular format.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count) { |matched| matched.size }
  .to_tabular
# result:
[
  ["less_than_5", "even", "count"],
  [true, false, 2],
  [true, true, 2],
  [false, false, 3],
  [false, true, 2]
]

CSV

Goldmine makes producing CSV output simple.

csv_table = Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count) { |matched| matched.size }
  .to_csv_table
# result:
#<CSV::Table mode:col_or_row row_count:5>

csv_table.to_csv
# result:
"less_than_5,even,count\ntrue,false,2\ntrue,true,2\nfalse,false,3\nfalse,true,2\n"

Examples

All examples are simple Sinatra apps. They are designed to help communicate Goldmine use-cases.

Setup

git clone [email protected]:hopsoft/goldmine.git
cd /path/to/goldmine
bundle

New York Wifi Hotspots

In this example, we mine the following data.

Total hotspots by city, zip, & area code
Free hotspots by city, zip, & area code
Paid hotspots by city, zip, & area code
Library hotspots by city, zip, & area code
Starbucks hotspots by city, zip, & area code
McDonalds hotspots by city, zip, & area code

ruby examples/new_york_wifi_hotspots/app.rb

curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv

Medicare Physician Comparison

Uses data from http://dev.socrata.com/foundry/#/data.medicare.gov/aeay-dfax

In this example, we mine the following data.

Total doctors by state & specialty
Preferred doctors by state & specialty
Female doctors by state & specialty
Male doctors by state & specialty
Preferred female doctors by state & specialty
Preferred male doctors by state & specialty

ruby examples/medicare_physician_compare/app.rb

curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv

Performance

The Medicare dataset is large & works well for performance testing.

My Macbook Pro yields the following benchmarks.

3.1 GHz Intel Core i7
16 GB 1867 MHz DDR3

100,000 Records

                      user     system      total        real
pivoted           1.000000   0.020000   1.020000 (  1.027810)
rolled_up         1.090000   0.020000   1.110000 (  1.101082)
rows              0.020000   0.000000   0.020000 (  0.022978)
tabular           0.010000   0.000000   0.010000 (  0.005423)
csv               0.030000   0.000000   0.030000 (  0.037245)

1,000,000 Records

                      user     system      total        real
pivoted          15.700000   0.490000  16.190000 ( 16.886677)
rolled_up         7.070000   0.350000   7.420000 (  7.544060)
rows              0.020000   0.000000   0.020000 (  0.028432)
tabular           0.010000   0.010000   0.020000 (  0.007663)
csv               0.050000   0.000000   0.050000 (  0.058925)

Summary

Goldmine makes data highly malleable. It allows you to combine the power of pivots, rollups, tabular data, & csv to construct deep insights with minimal effort.

Real world use cases include:

Build a better understanding of database data before canonizing reports in SQL
Create source data for building user interfaces & data visualizations
Transform CSV data from one format to another