Redshifter

Provides a resque job and rake tasks to facilitate ETL (Extract Transform Load) processing of Postgres tables for export to a Redshift cluster.

Specifically provides: 1) Create/Replace job to replace all model data in Redshift 2) Update job updates all records created or updated since the last update ran

Limitations:

  • deleted records are NOT synced to Redshift by the update job

Feature Roadmap:

  • store last runtime locally instead of using potentially costly redshift query

Installation

Add this line to your application's Gemfile:

gem 'redshifter'

And then execute:

$ bundle

Or install it yourself as:

$ gem install redshifter

Usage

Setup Redshifter in a Rails initializer

# config/initializers/redshifter.rb
Redshifter.setup do |config|
  # path in your app available for writing temp files
  config.temp_directory_path = File.expand_path('../../tmp', __FILE__)
  # path to redshifter table config
  config.table_config_path = File.expand_path('../redshfiter.rb', __FILE__)
  # redshift user should have access to create tables in the specified schema
  config.redshift_username = 'your_app_user'
  config.redshift_password = 'p@ssw0rd'
  config.redshift_host = 'app.host.without.protocol.com'
  config.redshift_port = 5439
  config.redshift_database = 'database_name'
  config.redshift_schema = 'a_schema'
  # AWS user should be allowed full access to the specified bucket
  config.aws_access_key_id = '<AWS user access key ID>'
  config.aws_secret_access_key = '<AWS user secret access key>'
  config.s3_bucket = 'a_redshifter_bucket'
  config.s3_bucket_region = 'us-east-1'
  # Heroku user must be a member of the app and have privileges to
  # start new dynos
  config.heroku_api_key = '<Heroku user api key>'
  config.heroku_app_name = 'name of the app on heroku'
  config.logger = Rails.logger
end

Require Redshifter tasks in your Rakefile

# Rakefile
# ...
require 'redshifter/tasks'
# ...

Create a config file describing the tables, columns and transforms for export to redshift

# config/redshifter.rb
Redshifter.config.tables = {
  'books_with_export_at' => {
    # [required] Source *table* name, not the Rails model name
    source_table_name: 'books',
    # [required] Prefixing your redshift table with its source is recommended
    redshift_table_name: 'app_name_books',
    # [optional] Provide a conditional to specify which rows get exported to Redshift
    source_table_filter: 'title IS NOT NULL',
    # [required] Columns with Redshift datatypes to create; may differ from source DB
    redshift_columns: {
      'id' => 'INTEGER',
      'title' => 'VARCHAR(128)',
      'json_value' => 'BOOLEAN',
      'published_at' => 'TIMESTAMP',
      'updated_at' => 'TIMESTAMP',
      'exported_at' => 'TIMESTAMP'
    },
    # [optional] SQL statements to transform or populate redshift columns from
    # source DB. By default, redshift columns will be populated from source
    # column with the same name. Column key must exist in redshift_columns.
    # If a matching source column does not exist you MUST specify it here.
    #
    # Extract json values as text (using ->>), then cast non-strings to appropriate type
    source_column_transforms: {
      'title' => "lower(title)",
      'json_value' => "cast(json_column->>'json_key' as BOOLEAN)",
      'published_at' => 'first_edition_published_at',
      'exported_at' => 'now()'
    },
    # [required] valid values: KEY, EVEN, ALL
    redshift_distribution_style: 'KEY',
    # [required, if redshift_distribution_style: 'KEY'] distribution key column
    # name MUST be present in redshift_columns.keys
    redshift_distribution_key: 'id',
    # [optional] valid values: COMPOUND, INTERLEAVED; If omitted the Redshift
    # table DDL statement will not specify the sort style and Redshift will
    # implicitly default to COMPOUND style.
    redshift_sort_style: 'INTERLEAVED',
    # [required] Column names MUST be present in redshift_columns.keys; Max
    # length of 8 when using INTERLEAVED sort style, and 400 when using
    # COMPOUND sort style.
    redshift_sort_keys: ['published_at'],
    # [optional] Used for query planning in Redshift
    redshift_primary_key: 'id'
  }
}

Run redshifter:replace rake task for each table you want to export

$ rake redshifter:replace[books_with_export_at]

Schedule a Redshifter::Job::UpdateRedshiftTableJob resque job per each table you want to export updates for

Then schedule this meta job to run in resque_schedule.yml to run once at 10:00pm

# config/resque_schedule.yml

etl_books_to_redshift:
  cron: "0 22 * * *"
  class: 'Redshifter::Job::UpdateRedshiftTableJob'
  args: 'books_with_export_at'
  description: 'Export the books table to Redshift'

Monitoring Rake tasks with New Relic (optional)

New Relic offers Rake task instrumentation as of version 3.13.0 of their newrelic_rpm agent. Redshifter does not directly use or depend on newrelic_rpm. You must use New Relic's rake instrumentation and explicitly identify the tasks in your app that you want to monitor. See New Relic docs.

In addition, to setting attributes.include and rake.tasks as defined in their docs, it also seems to be necessary to manually start the agent synchonously in your Rakefile to assure that fast running tasks are reported.

In summary, your newrelic.yml should include keys and values like this to monitor redshifter rake tasks:

common: &default_settings
  #...
  attributes:
    include: job.rake.*  # allows rake args reporting
  rake:
    # rake task monitoring must be white listed here AND not blacklisted via
    #   autostart.blacklisted_* config values
    tasks: ['redshifter:update', 'redshifter:replace']
#...

and your Rakefile should end up looking something like this:

# Rakefile
# ...
require 'redshifter/tasks'

# Force agent start assuring rake.tasks listed in newrelic.yml are instrumented
NewRelic::Agent.manual_start(sync_startup: true) if Rails.env.production?
# ...

Development

After checking out the repo, run bin/setup to install dependencies. Then, run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release to create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

  1. Fork it ( https://github.com/[my-github-username]/redshifter/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request