Lines of Code Code Status Dependency Status Build Status Coverage Status Downloads

Goldmine

Extract a wealth of information from Arrays & Hashes.

Goldmine is especially helpful when working with source data that is difficult to query. e.g. CSV files, API results, etc...

Uses

  • Data mining
  • Data transformation
  • Data blending
  • Data visualization prep
  • CSV report generation

Quick Start

gem install goldmine
require "goldmine"

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot { |i| i < 5 }
# result:
{
  true  => [1, 2, 3, 4],
  false => [5, 6, 7, 8, 9]
}

Chained Pivots

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot { |i| i < 5 }
  .pivot { |i| i % 2 == 0 }
# result:
{
  [true, false]  => [1, 3],
  [true, true]   => [2, 4],
  [false, false] => [5, 7, 9],
  [false, true]  => [6, 8]
}

Named Pivots

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
# result:
{
  { :less_than_5 => true }  => [1, 2, 3, 4],
  { :less_than_5 => false } => [5, 6, 7, 8, 9]
}

Value Pivots

list = [
  { :name => "Sally",   :favorite_colors => [:blue] },
  { :name => "John",    :favorite_colors => [:blue, :green] },
  { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
  { :name => "Emily",   :favorite_colors => [:orange, :green] },
  { :name => "Joe",     :favorite_colors => [:red] }
]
list = Goldmine::ArrayMiner.new(list)
list.pivot { |record| record[:favorite_colors] }
# result:
{
  :blue => [
    { :name => "Sally", :favorite_colors => [:blue] },
    { :name => "John",  :favorite_colors => [:blue, :green] }
  ],
  :green => [
    { :name => "John",  :favorite_colors => [:blue, :green] },
    { :name => "Emily", :favorite_colors => [:orange, :green] }
  ],
  :red => [
    { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
    { :name => "Joe",     :favorite_colors => [:red] }
  ],
  :pink => [
    { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] }
  ],
  :purple => [
    { :name => "Stephen", :favorite_colors => [:red, :pink, :purple] }
  ],
  :orange => [
    { :name => "Emily", :favorite_colors => [:orange, :green] }
  ]
}

Stacked pivots

list = [
  { :name => "Sally",   :age => 21 },
  { :name => "John",    :age => 28 },
  { :name => "Stephen", :age => 37 },
  { :name => "Emily",   :age => 32 },
  { :name => "Joe",     :age => 18 }
]
list = Goldmine::ArrayMiner.new(list)
mined = list.pivot("Name has an 'e'") do |record|
  !!record[:name].match(/e/i)
end
mined = mined.pivot(">= 21 years old") do |record|
  record[:age] >= 21
end
# result:
{
  { "Name has an 'e'" => false, ">= 21 years old" => true } => [
    { :name => "Sally", :age => 21 },
    { :name => "John",  :age => 28 }
  ],
  { "Name has an 'e'" => true, ">= 21 years old" => true } => [
    { :name => "Stephen", :age => 37 },
    { :name => "Emily",   :age => 32 }
  ],
  { "Name has an 'e'" => true, ">= 21 years old" => false } => [
    { :name => "Joe", :age => 18 }
  ]
}

Rollups

Rollups provide a clean way to aggregate pivoted data... think computed columns.

Rollup blocks are executed once for each pivot. Like pivots, rollups can be chained.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count) { |matched| matched.size }
# result:
{
  { :less_than_5 => true, :even => false }  => { :count => 2 },
  { :less_than_5 => true, :even => true }   => { :count => 2 },
  { :less_than_5 => false, :even => false } => { :count => 3 },
  { :less_than_5 => false, :even => true }  => { :count => 2 }
}

Pre-Computed Results

Rollups can be computationally expensive (depending upon how much logic you stuff into the block). Goldmine caches rollup results & makes them available to subsequent rollups.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .rollup(:count, &:size)
  .rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.size }
  .rollup(:even_percentage) { |list|
    computed(:evens).for(list) / computed(:count).for(list).to_f
  }
# result:
{
  { :less_than_5 => true } => { :count => 4, :evens => 2, :even_percentage => 0.5 },
  { :less_than_5 => false } => { :count => 5, :evens => 2, :even_percentage => 0.4 }
}

Rows

It's often helpful to flatten rollups into rows.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .rollup(:count, &:size)
  .rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.size }
  .rollup(:even_percentage) { |list|
    computed(:evens).for(list) / computed(:count).for(list).to_f
  }
  .to_rows
# result:
[
  { "less_than_5" => true, "count" => 4, "evens" => 2, "even_percentage" => 0.5 },
  { "less_than_5" => false, "count" => 5, "evens" => 2, "even_percentage" => 0.4 }
]

Tabular

Rollups can also be converted into tabular format.

list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count) { |matched| matched.size }
  .to_tabular
# result:
[
  ["less_than_5", "even", "count"],
  [true, false, 2],
  [true, true, 2],
  [false, false, 3],
  [false, true, 2]
]

CSV

Goldmine makes producing CSV output simple.

csv_table = Goldmine::ArrayMiner.new(list)
  .pivot(:less_than_5) { |i| i < 5 }
  .pivot(:even) { |i| i % 2 == 0 }
  .rollup(:count) { |matched| matched.size }
  .to_csv_table
# result:
#<CSV::Table mode:col_or_row row_count:5>

csv_table.to_csv
# result:
"less_than_5,even,count\ntrue,false,2\ntrue,true,2\nfalse,false,3\nfalse,true,2\n"

Examples

All examples are simple Sinatra apps. They are designed to help communicate Goldmine use-cases.

Setup

git clone [email protected]:hopsoft/goldmine.git
cd /path/to/goldmine
bundle

New York Wifi Hotspots

In this example, we mine the following data.

  • Total hotspots by city, zip, & area code
  • Free hotspots by city, zip, & area code
  • Paid hotspots by city, zip, & area code
  • Library hotspots by city, zip, & area code
  • Starbucks hotspots by city, zip, & area code
  • McDonalds hotspots by city, zip, & area code
ruby examples/new_york_wifi_hotspots/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv

Medicare Physician Comparison

Uses data from http://dev.socrata.com/foundry/#/data.medicare.gov/aeay-dfax

In this example, we mine the following data.

  • Total doctors by state & specialty
  • Preferred doctors by state & specialty
  • Female doctors by state & specialty
  • Male doctors by state & specialty
  • Preferred female doctors by state & specialty
  • Preferred male doctors by state & specialty
ruby examples/medicare_physician_compare/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv

Performance

The Medicare dataset is large & works well for performance testing.

My Macbook Pro yields the following benchmarks.

  • 3.1 GHz Intel Core i7
  • 16 GB 1867 MHz DDR3
100,000 Records
                      user     system      total        real
pivoted           1.000000   0.020000   1.020000 (  1.027810)
rolled_up         1.090000   0.020000   1.110000 (  1.101082)
rows              0.020000   0.000000   0.020000 (  0.022978)
tabular           0.010000   0.000000   0.010000 (  0.005423)
csv               0.030000   0.000000   0.030000 (  0.037245)
1,000,000 Records
                      user     system      total        real
pivoted          15.700000   0.490000  16.190000 ( 16.886677)
rolled_up         7.070000   0.350000   7.420000 (  7.544060)
rows              0.020000   0.000000   0.020000 (  0.028432)
tabular           0.010000   0.010000   0.020000 (  0.007663)
csv               0.050000   0.000000   0.050000 (  0.058925)

Summary

Goldmine makes data highly malleable. It allows you to combine the power of pivots, rollups, tabular data, & csv to construct deep insights with minimal effort.

Real world use cases include:

  • Build a better understanding of database data before canonizing reports in SQL
  • Create source data for building user interfaces & data visualizations
  • Transform CSV data from one format to another