Infobright Ruby Loader
Overview
Infobright Ruby Loader (IRL) is a data loader for Infobright Community Edition (ICE) and Enterprise Edition (IEE), built in Ruby.
IRL was inspired by ParaFlex, a Bash script from the Infobright team to perform parallel loading of ICE and IEE.
IRL can be used in two ways:
- As a command-line tool - i.e. as a direct alternative to ParaFlex. No Ruby expertise required
- As part of another application - IRL is a Ruby gem with a Ruby API, so can be integrated into larger Ruby ETL processes (such as SnowPlow's)
Main differences from ParaFlex
The main differences between IRL and ParaFlex are as follows:
- IRL can be integrated into Ruby apps (see above)
- IRL lets you specify the Infobright username and/or password
- IRL lets you specify the data delimiter and encloser
- IRL allows loads of multiple files into the same table - it just runs them in series, not in parallel
- IRL can be fed a directory of files, as well as a file list
Installation
Overview
To add.
For command-line use
You can install IRL like so:
$ gem install infobright-loader
For use in your own application
Add this line to your application's Gemfile:
gem 'infobright-loader'
And then execute:
$ bundle
Usage
Operation modes
IRL has two main ways of operating:
- Loading all the files from a specific directory into a specific table
- Loading a set of tables from a set of files (where each table can have multiple files loaded into it)
Both modes of use are available whether you are running IRL from the command-line or from another Ruby application:
From the command-line
You can use IRL from the command-line:
$ bundle exec infobright-loader -v
infobright-loader 0.0.1
Usage options
The usage options look like this:
Usage: infobright-loader [options]
Specify a control file:
-c, --control FILE control file
-x, --processes INT optional number of parallel processes to run *
Or load a table from a folder of data files:
-d, --db NAME database name *
-u, --username NAME database username *
-p, --password NAME database password *
-t, --table NAME table to load data files into
-f, --folder DIR directory containing data files to load
-s, --separator CHAR optional field separator, defaults to pipe bar (|) *
-e, --encloser CHAR optional field encloser, defaults to none *
* overrides the same setting in the control file if control file also specified
Common options:
-h, --help Show this message
-v, --version Show version
In other words, you can run IRL from the command-line in two ways:
- With
--control
specifying a control file containing a list of tables to load, each with a list of files - With
--db
,--table
and--folder
to load all the files from a specific directory into a single database table
As an added bonus, if you are using a control file you can still specify the asterisked parameters at the command-line, to override the settings found in your control file.
Control file format
You can find a template control file in the repository as control-file/template.yml
. Its contents
is as follows:
# Example control file for Infobright Ruby Loader
# Can be overridden at the command line...
:load:
:processes: ADD HERE
:database:
:name: ADD HERE
:username: ADD HERE # Or leave blank to default to the user running the script
:password: ADD HERE # Or leave blank if no password
:data_format:
:separator: ADD HERE
:encloser: ADD HERE # Or leave blank if no encloser
# ... end of variables overridable at command line.
# Map of tables to populate, along with files to load for each table
:data_loads:
# For each table, list the data files to load
TABLE_NAME_1:
- PATH/TO/FILE-1
- PATH/TO/FILE-2
TABLE_NAME_2:
- PATH/TO/FILE-3
- PATH/TO/FILE-4
From your own application
Using IRL from your own Ruby application (e.g. an ETL process) is quite straightforward.
First require the necessary file:
require 'infobright-loader/loader'
Now populate a DbConfig
struct:
db = InfobrightLoader::Db::DbConfig.new('my-db', 'my-db-user', nil) # No password
And now you're ready to load either a single table or a hash of tables:
Load a single table
Loading a single table from a folder is quite straightforward:
InfobrightLoader::Loader::load_from_folder(
'/data/snowplow/etl-fla/latest', # folder
'snowplow_events', # table
db, # database config
'/t', # field separator
'' # field encloser
)
Note that the last two arguments are optional - they default to the pipe bar (|) and empty () respectively.
Load a hash of tables
To load a hash of tables, let's first create the hash:
load_hash = {}
load_hash[impressions] = ['/tmp/imps-1', '/tmp/imps-2', '/tmp/imps-3']
load_hash[clicks] = ['/tmp/clicks-1', '/tmp/clicks-2']
load_hash[conversions] = ['/tmp/convs-1', '/tmp/convs-2', '/tmp/convs-4', '/tmp/convs-4']
load_hash[bids] = ['/tmp/bids-1', '/tmp/bids-2', '/tmp/bids-4', '/tmp/bids-4']
Now we can run a parallel load of the tables:
InfobrightLoader::Loader::load_from_hash(
load_hash, # hash of tables to load into
db, # database config
3, # number of table loads to run in parallel (using Ruby threads)
'/t', # field separator
'' # field encloser
)
The last three arguments are optional:
- Number of processes defaults to
10
or the number of tables to populate, whichever is lower - The field separator defaults to the pipe bar (
|
) - The field encloser defaults to empty ()
Roadmap
- Add error handling for when individual file loads fail
- Move tests into Rspec or Cucumber from Bash
- Add metrics so load time can be reported
Hacking and contributing
Hacking locally
- Build the gem (
gem build infobright-loader.gemspec
) - Install the gem (
sudo gem install ./infobright-loader-0.0.1.gem
)
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
Copyright and license
Infobright Ruby Loader is copyright 2012 SnowPlow Analytics Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.