distyll

Suppose that you’re writing code for a project that’s been in production for a long time. All said and done, the production database size is on the order of 100 GB. When you’ve finished writing your new feature, you test it on your seeds. All is well. However, how do you know that it will work on production data? How do you know that your seeds accurately reflect the variance (and oddities) present in the production data set?

Distyll attempts to solve this by creating a “recent” subset of the production data set. This could be done naively by taking all of the records across the whole database with a created_at above a certain time. However, a record created today may have an associated record (via a foreign key) which was created five years ago. If you slice the entire database by created_at timestamps, you’ll have foreign keys which point nowhere. Not very helpful for ensuring that your new feature works on production data.

Distyll’s solution is to start from a set of “core” ActiveRecord models supplied at initialization time (plus a date threshold for these models), and only pull those that have been created since the date threshold. It then traverses all belongs_to relationships from those core models and pulls in all of those related records.

Consequently, you end up with a data set that is representative of production, is internally consistent, and is smaller.

Using distyll in your project

  1. Add gem 'distyll' to your gemfile

  2. Run bundle install

  3. Add a distyll: database to your database.yml

  4. Run rake db:create RAILS_ENV=distyll

  5. Run rake db:schema:load RAILS_ENV=distyll

  6. Run rails console

  7. Call Distyll.new(model_names, created_since), passing it an array of strings of the core models and a date after which core records will be copied.

If you need to clear out the distyll database and try again with different parameters, just go back to the schema:load step and continue from there.

Contributing to distyll

  • Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet.

  • Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it.

  • Fork the project.

  • Start a feature/bugfix branch.

  • Commit and push until you are happy with your contribution.

  • Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Next Steps for distyll

  • Distyll only traverses belongs_to associations for now. Need to consider other association types.

  • Is likely to cause problems with single table inheritance. Could probably refer to table names rather than model names when traversing relationships… but this would still be an issue for the base models.

  • Currently performs “IN” query. In Oracle, this is limited to 1000 values, so I would need to chunk them for that DBMS.

  • Tests. I know. I just don’t yet have my head around how to test something that’s SO model- and database-centric, when those models and databases aren’t present in the gem. Any advice would be appreciated.

Copyright © 2014 Mason F. Matthews. See LICENSE.txt for further details.