Goldmine
Extract a wealth of information from Arrays & Hashes.
Goldmine is especially helpful when working with source data that is difficult to query. e.g. CSV files, API results, etc...
Uses
- Data mining
- Data transformation
- Data blending
- Data visualization prep
- CSV report generation
Quick Start
gem install goldmine
require "goldmine"
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot { |i| i < 5 }
# result:
{
true => [1, 2, 3, 4],
false => [5, 6, 7, 8, 9]
}
Chained Pivots
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot { |i| i < 5 }
.pivot { |i| i % 2 == 0 }
# result:
{
[true, false] => [1, 3],
[true, true] => [2, 4],
[false, false] => [5, 7, 9],
[false, true] => [6, 8]
}
Named Pivots
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot(:less_than_5) { |i| i < 5 }
# result:
{
{ :less_than_5 => true } => [1, 2, 3, 4],
{ :less_than_5 => false } => [5, 6, 7, 8, 9]
}
Value Pivots
list = [
{ :name => "Sally", :favorite_colors => [:blue] },
{ :name => "John", :favorite_colors => [:blue, :green] },
{ :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
{ :name => "Emily", :favorite_colors => [:orange, :green] },
{ :name => "Joe", :favorite_colors => [:red] }
]
list = Goldmine::ArrayMiner.new(list)
list.pivot { |record| record[:favorite_colors] }
# result:
{
:blue => [
{ :name => "Sally", :favorite_colors => [:blue] },
{ :name => "John", :favorite_colors => [:blue, :green] }
],
:green => [
{ :name => "John", :favorite_colors => [:blue, :green] },
{ :name => "Emily", :favorite_colors => [:orange, :green] }
],
:red => [
{ :name => "Stephen", :favorite_colors => [:red, :pink, :purple] },
{ :name => "Joe", :favorite_colors => [:red] }
],
:pink => [
{ :name => "Stephen", :favorite_colors => [:red, :pink, :purple] }
],
:purple => [
{ :name => "Stephen", :favorite_colors => [:red, :pink, :purple] }
],
:orange => [
{ :name => "Emily", :favorite_colors => [:orange, :green] }
]
}
Stacked pivots
list = [
{ :name => "Sally", :age => 21 },
{ :name => "John", :age => 28 },
{ :name => "Stephen", :age => 37 },
{ :name => "Emily", :age => 32 },
{ :name => "Joe", :age => 18 }
]
list = Goldmine::ArrayMiner.new(list)
mined = list.pivot("Name has an 'e'") do |record|
!!record[:name].match(/e/i)
end
mined = mined.pivot(">= 21 years old") do |record|
record[:age] >= 21
end
# result:
{
{ "Name has an 'e'" => false, ">= 21 years old" => true } => [
{ :name => "Sally", :age => 21 },
{ :name => "John", :age => 28 }
],
{ "Name has an 'e'" => true, ">= 21 years old" => true } => [
{ :name => "Stephen", :age => 37 },
{ :name => "Emily", :age => 32 }
],
{ "Name has an 'e'" => true, ">= 21 years old" => false } => [
{ :name => "Joe", :age => 18 }
]
}
Rollups
Rollups provide a clean way to aggregate pivoted data... think computed columns.
Rollup blocks
are executed once for each pivot.
Like pivots, rollups can be chained.
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot(:less_than_5) { |i| i < 5 }
.pivot(:even) { |i| i % 2 == 0 }
.rollup(:count) { |matched| matched.size }
# result:
{
{ :less_than_5 => true, :even => false } => { :count => 2 },
{ :less_than_5 => true, :even => true } => { :count => 2 },
{ :less_than_5 => false, :even => false } => { :count => 3 },
{ :less_than_5 => false, :even => true } => { :count => 2 }
}
Pre-Computed Results
Rollups can be computationally expensive (depending upon how much logic you stuff into the block
).
Goldmine caches rollup results & makes them available to subsequent rollups.
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot(:less_than_5) { |i| i < 5 }
.rollup(:count, &:size)
.rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.size }
.rollup(:even_percentage) { |list|
computed(:evens).for(list) / computed(:count).for(list).to_f
}
# result:
{
{ :less_than_5 => true } => { :count => 4, :evens => 2, :even_percentage => 0.5 },
{ :less_than_5 => false } => { :count => 5, :evens => 2, :even_percentage => 0.4 }
}
Rows
It's often helpful to flatten rollups into rows.
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot(:less_than_5) { |i| i < 5 }
.rollup(:count, &:size)
.rollup(:evens) { |list| list.select { |i| i % 2 == 0 }.size }
.rollup(:even_percentage) { |list|
computed(:evens).for(list) / computed(:count).for(list).to_f
}
.to_rows
# result:
[
{ "less_than_5" => true, "count" => 4, "evens" => 2, "even_percentage" => 0.5 },
{ "less_than_5" => false, "count" => 5, "evens" => 2, "even_percentage" => 0.4 }
]
Tabular
Rollups can also be converted into tabular format.
list = [1,2,3,4,5,6,7,8,9]
Goldmine::ArrayMiner.new(list)
.pivot(:less_than_5) { |i| i < 5 }
.pivot(:even) { |i| i % 2 == 0 }
.rollup(:count) { |matched| matched.size }
.to_tabular
# result:
[
["less_than_5", "even", "count"],
[true, false, 2],
[true, true, 2],
[false, false, 3],
[false, true, 2]
]
CSV
Goldmine makes producing CSV output simple.
csv_table = Goldmine::ArrayMiner.new(list)
.pivot(:less_than_5) { |i| i < 5 }
.pivot(:even) { |i| i % 2 == 0 }
.rollup(:count) { |matched| matched.size }
.to_csv_table
# result:
#<CSV::Table mode:col_or_row row_count:5>
csv_table.to_csv
# result:
"less_than_5,even,count\ntrue,false,2\ntrue,true,2\nfalse,false,3\nfalse,true,2\n"
Examples
All examples are simple Sinatra apps. They are designed to help communicate Goldmine use-cases.
Setup
git clone [email protected]:hopsoft/goldmine.git
cd /path/to/goldmine
bundle
New York Wifi Hotspots
In this example, we mine the following data.
- Total hotspots by city, zip, & area code
- Free hotspots by city, zip, & area code
- Paid hotspots by city, zip, & area code
- Library hotspots by city, zip, & area code
- Starbucks hotspots by city, zip, & area code
- McDonalds hotspots by city, zip, & area code
ruby examples/new_york_wifi_hotspots/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv
Medicare Physician Comparison
Uses data from http://dev.socrata.com/foundry/#/data.medicare.gov/aeay-dfax
In this example, we mine the following data.
- Total doctors by state & specialty
- Preferred doctors by state & specialty
- Female doctors by state & specialty
- Male doctors by state & specialty
- Preferred female doctors by state & specialty
- Preferred male doctors by state & specialty
ruby examples/medicare_physician_compare/app.rb
curl http://localhost:3000/raw
curl http://localhost:3000/pivoted
curl http://localhost:3000/rolled_up
curl http://localhost:3000/rows
curl http://localhost:3000/tabular
curl http://localhost:3000/csv
Performance
The Medicare dataset is large & works well for performance testing.
My Macbook Pro yields the following benchmarks.
- 3.1 GHz Intel Core i7
- 16 GB 1867 MHz DDR3
100,000 Records
user system total real
pivoted 1.000000 0.020000 1.020000 ( 1.027810)
rolled_up 1.090000 0.020000 1.110000 ( 1.101082)
rows 0.020000 0.000000 0.020000 ( 0.022978)
tabular 0.010000 0.000000 0.010000 ( 0.005423)
csv 0.030000 0.000000 0.030000 ( 0.037245)
1,000,000 Records
user system total real
pivoted 15.700000 0.490000 16.190000 ( 16.886677)
rolled_up 7.070000 0.350000 7.420000 ( 7.544060)
rows 0.020000 0.000000 0.020000 ( 0.028432)
tabular 0.010000 0.010000 0.020000 ( 0.007663)
csv 0.050000 0.000000 0.050000 ( 0.058925)
Summary
Goldmine makes data highly malleable. It allows you to combine the power of pivots, rollups, tabular data, & csv to construct deep insights with minimal effort.
Real world use cases include:
- Build a better understanding of database data before canonizing reports in SQL
- Create source data for building user interfaces & data visualizations
- Transform CSV data from one format to another