distyll

Suppose that you’re writing code for a project that’s been in production for a long time. All said and done, the production database size is on the order of 100 GB. When you’ve finished writing your new feature, you test it on your seeds. All is well. However, how do you know that it will work on production data? How do you know that your seeds accurately reflect the variance (and oddities) present in the production data set?

Distyll attempts to solve this by creating a “recent” subset of the production data set. This could be done naively by taking all of the records across the whole database with a created_at above a certain time. However, a record created today may have an associated record (via a foreign key) which was created five years ago. If you slice the entire database by created_at timestamps, you’ll have foreign keys which point nowhere. Not very helpful for ensuring that your new feature works on production data.

Distyll’s solution is to start from a set of “core” ActiveRecord models supplied at runtime (plus a date threshold for these models), and only pull those that have been created since the date threshold. It then traverses associations from the core models and pulls in associated records (details in the next section).

Consequently, you end up with a data set that is representative of production, is internally consistent, and is smaller.

Associations and associated records

Distyll starts from the core models and stores the ids of all core records with created_at timestamps after the date threshold. It also marks these core models as include_has_many. This means that has_many and has_and_belongs_to_many relationships SHOULD be traversed out of the model. Distyll then counts all the pertinent ids across all pertinent models and starts a loop.

At each step of the loop, for each model in distyll’s list of pertinent models, it reaches out via all belongs_to associations (and other associations if the model is marked as include_has_many). Using these associations, it expands both (a) the set of pertinent models to be copied and (b) the pertinent ids for each of the pertinient models. In each step, it is only reaching out to associated models for pertinent ids which are new since the previous loop. This cuts down on query time.

There is a trick with has_many association traversal. The core models do allow for traversal “down” (into has_many and habtm associations), and any direct descendants of the core models continue traversal down. However, if a descendant’s associations are then followed “up” (through a belongs_to), then that parent cannot subsequently traverse down. In essence, “include_has_many” is contagious down, but not up.

Distyll also ignores “through” relationships. It assumes that those associations will be covered via the multi-step non-through traversal.

The looping stops when the total number of ids has not increased since the previous loop. Iterating with a while was necessary due to self-referrential joins. They prevent the use of a clean recursive algorithm.

Using distyll in your project

  1. Add gem 'distyll' to your gemfile

  2. Run bundle install

  3. Add a distyll: database to your database.yml

  4. Run rake db:create RAILS_ENV=distyll

  5. Run rake db:schema:load RAILS_ENV=distyll

  6. Run rails console

  7. Call Distyll.run(model_names, created_since), passing it an array of the core models and a

date after which core records will be copied.

If you need to clear out the distyll database and try again with different parameters, just go back to the schema:load step and continue from there.

Contributing to distyll

  • Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet.

  • Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it.

  • Fork the project.

  • Start a feature/bugfix branch.

  • Commit and push until you are happy with your contribution.

  • Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or

is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Next Steps for distyll

  • When finding pertinent ids across polymorphic has_many and habtm associations, Distyll currently ignores

the _type field, which may cause it to save more ids from the associated table than needed (if the foreign key matches some other record in a different table but the _type doesn’t match the current model.)

  • Currently performs “IN” queries. In Oracle, this is limited to 1000 values, so I would need to chunk

them for that DBMS.

  • Tests. I know. I just don’t yet have my head around how to test something that’s SO model- and

database-centric, when those models and databases aren’t present in the gem. Any advice would be appreciated.

Copyright © 2014 Mason F. Matthews. See LICENSE.txt for further details.