Redshifter
Provides a resque job and rake tasks to facilitate ETL (Extract Transform Load) processing of Postgres tables for export to a Redshift cluster.
Specifically provides: 1) Create/Replace job to replace all model data in Redshift 2) Update job updates all records created or updated since the last update ran
Limitations:
- deleted records are NOT synced to Redshift by the update job
Feature Roadmap:
- store last runtime locally instead of using potentially costly redshift query
Installation
Add this line to your application's Gemfile:
gem 'redshifter'
And then execute:
$ bundle
Or install it yourself as:
$ gem install redshifter
Usage
Setup Redshifter in a Rails initializer
# config/initializers/redshifter.rb
Redshifter.setup do |config|
# path in your app available for writing temp files
config.temp_directory_path = File.('../../tmp', __FILE__)
# path to redshifter table config
config.table_config_path = File.('../redshfiter.rb', __FILE__)
# redshift user should have access to create tables in the specified schema
config.redshift_username = 'your_app_user'
config.redshift_password = 'p@ssw0rd'
config.redshift_host = 'app.host.without.protocol.com'
config.redshift_port = 5439
config.redshift_database = 'database_name'
config.redshift_schema = 'a_schema'
# AWS user should be allowed full access to the specified bucket
config.aws_access_key_id = '<AWS user access key ID>'
config.aws_secret_access_key = '<AWS user secret access key>'
config.s3_bucket = 'a_redshifter_bucket'
config.s3_bucket_region = 'us-east-1'
# Heroku user must be a member of the app and have privileges to
# start new dynos
config.heroku_api_key = '<Heroku user api key>'
config.heroku_app_name = 'name of the app on heroku'
config.logger = Rails.logger
end
Require Redshifter tasks in your Rakefile
# Rakefile
# ...
require 'redshifter/tasks'
# ...
Create a config file describing the tables, columns and transforms for export to redshift
# config/redshifter.rb
Redshifter.config.tables = {
'books_with_export_at' => {
# [required] Source *table* name, not the Rails model name
source_table_name: 'books',
# [required] Prefixing your redshift table with its source is recommended
redshift_table_name: 'app_name_books',
# [optional] Provide a conditional to specify which rows get exported to Redshift
source_table_filter: 'title IS NOT NULL',
# [required] Columns with Redshift datatypes to create; may differ from source DB
redshift_columns: {
'id' => 'INTEGER',
'title' => 'VARCHAR(128)',
'json_value' => 'BOOLEAN',
'published_at' => 'TIMESTAMP',
'updated_at' => 'TIMESTAMP',
'exported_at' => 'TIMESTAMP'
},
# [optional] SQL statements to transform or populate redshift columns from
# source DB. By default, redshift columns will be populated from source
# column with the same name. Column key must exist in redshift_columns.
# If a matching source column does not exist you MUST specify it here.
#
# Extract json values as text (using ->>), then cast non-strings to appropriate type
source_column_transforms: {
'title' => "lower(title)",
'json_value' => "cast(json_column->>'json_key' as BOOLEAN)",
'published_at' => 'first_edition_published_at',
'exported_at' => 'now()'
},
# [required] valid values: KEY, EVEN, ALL
redshift_distribution_style: 'KEY',
# [required, if redshift_distribution_style: 'KEY'] distribution key column
# name MUST be present in redshift_columns.keys
redshift_distribution_key: 'id',
# [optional] valid values: COMPOUND, INTERLEAVED; If omitted the Redshift
# table DDL statement will not specify the sort style and Redshift will
# implicitly default to COMPOUND style.
redshift_sort_style: 'INTERLEAVED',
# [required] Column names MUST be present in redshift_columns.keys; Max
# length of 8 when using INTERLEAVED sort style, and 400 when using
# COMPOUND sort style.
redshift_sort_keys: ['published_at'],
# [optional] Used for query planning in Redshift
redshift_primary_key: 'id'
}
}
Run redshifter:replace rake task for each table you want to export
$ rake redshifter:replace[books_with_export_at]
Schedule a Redshifter::Job::UpdateRedshiftTableJob resque job per each table you want to export updates for
Then schedule this meta job to run in resque_schedule.yml
to run once at 10:00pm
# config/resque_schedule.yml
etl_books_to_redshift:
cron: "0 22 * * *"
class: 'Redshifter::Job::UpdateRedshiftTableJob'
args: 'books_with_export_at'
description: 'Export the books table to Redshift'
Monitoring Rake tasks with New Relic (optional)
New Relic offers Rake task instrumentation as of version 3.13.0
of their newrelic_rpm
agent. Redshifter does not directly use or depend on newrelic_rpm
. You must use New Relic's rake instrumentation and explicitly identify the tasks in your app that you want to monitor. See New Relic docs.
In addition, to setting attributes.include
and rake.tasks
as defined in their docs, it also seems to be necessary to manually start the agent synchonously in your Rakefile to assure that fast running tasks are reported.
In summary, your newrelic.yml
should include keys and values like this to monitor redshifter rake tasks:
common: &default_settings
#...
attributes:
include: job.rake.* # allows rake args reporting
rake:
# rake task monitoring must be white listed here AND not blacklisted via
# autostart.blacklisted_* config values
tasks: ['redshifter:update', 'redshifter:replace']
#...
and your Rakefile
should end up looking something like this:
# Rakefile
# ...
require 'redshifter/tasks'
# Force agent start assuring rake.tasks listed in newrelic.yml are instrumented
NewRelic::Agent.manual_start(sync_startup: true) if Rails.env.production?
# ...
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
to create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
- Fork it ( https://github.com/[my-github-username]/redshifter/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request