Purview
A framework designed to simplify data warehousing
Installation
Add this line to your application's Gemfile:
gem 'purview'
And then execute:
$ bundle
Or install it yourself as:
$ gem install purview
Usage
Load the MySQL client (for PostgreSQL simply change 'mysql2' to 'pg' -- when
using this gem in a JRuby environment the 'jdbc/mysql' and/or 'jdbc/postgres'
library must be installed/available)
require 'mysql2'
Set the table-name (this can be anything, but it must exist)
table_name = :users
Define the Column(s) (available column-types: Boolean, CreatedTimestamp,
Date, Float, Id, Integer, Money, String, Text, Time, Timestamp,
UpdatedTimestamp & UUID -- the Id, CreatedTimestamp & UpdatedTimestamp
columns are required for all tables)
id_column = Purview::Columns::Id.new(:id),
name_column = Purview::Columns::String.new(:name, :nullable => false),
email_column = Purview::Columns::String.new(:email, :nullable => false, :limit => 100),
created_at_column = Purview::Columns::CreatedTimestamp.new(:created_at),
updated_at_column = Purview::Columns::UpdatedTimestamp.new(:updated_at),
columns = [
id_column,
name_column,
email_column,
created_at_column,
updated_at_column,
]
Define the Indices (availble index-types: Composite & Simple). By default
Indices will be added for the required column-types (CreatedTimestamp &
UpdatedTimestamp)
indices = [
Purview::Indices::Simple.new(email_column),
]
Configure the Puller (available puller-types: MySQL, PostgreSQL & URI)
puller_opts = {
:type => Purview::Pullers::URI,
:uri => 'http://feed.test.com/users',
}
Configure the Parser (available parser-types: CSV, SQL & TSV)
parser_opts = {
:type => Purview::Parsers::TSV,
}
Configure the Loader (for PostgreSQL simply change MySQL to PostgreSQL)
loader_opts = {
:type => Purview::Loaders::MySQL,
}
Combine all the configuration options and instantiate the Table
table_opts = {
:columns => columns,
:indices => indices,
:loader => loader_opts,
:parser => parser_opts,
:puller => puller_opts,
}
table = Purview::Tables::Raw.new(
table_name,
table_opts
)
Set the database-name (this can be anything, but it must exist)
database_name = :data_warehouse
Combine all the configuration options and instantiate the Database (for
PostgreSQL simply change MySQL to PostgreSQL)
database_opts = {
:tables => [table],
}
database = Purview::Databases::MySQL.new(
database_name,
database_opts
)
Add the Table to the Database (schema). In order for [the] Table to be
sync[ed] it must be added to [the] Database
database.add_table(table)
Create the Table (in the DB). Recommended for testing purposes only. For
production environments you will likely want an external process to manage the
schema (for PostgreSQL simply change Mysql2::Error to PG::DuplicateTable)
begin
database.create_table(table)
rescue Mysql2::Error
# Swallow
end
Initialize the Table (in the DB). This process sets the max_timestamp_pulled
value in the table_metadata table and is used by the candidate Table
selection algorithm to determine which Table should be synchronized next (the
least recently synchronized Table will be selected). This value is also used
as the high-water mark for records pulled from its source
database.initialize_table(table, )
Baseline the Table. This process will quickly get the state of the Table as
close to the current state as possible. This is generally useful when adding a
new Table to an existing schema (ideally this should be done while the Table
is disabled)
database.baseline_table(table)
Enable the Table (in the DB). This process sets the enabled_at value in the
table_metadata table and is used by the candidate Table selection algorithm
to determine the pool of Table(s) available for synchronization (to remove a
Table from the pool simply execute disable_table)
database.enable_table(table)
Sync the Database. This process will select a [candidate] Table, pull data
from its [remote-]source and reconcile the new data against the main-table (e.g.
perform INSERTs, UPDATEs and DELETEs). When multiple Table(s) are
configured the least recently pulled and available (enabled and not locked)
Table will be selected (you will likely want to configure some process to load
the schema run the sync at regularly scheduled intervals)
database.sync
Contributing
- Fork it ( http://github.com/jzaleski/purview/fork )
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create new Pull Request

