Build Status

DataJanitor

DataJanitor allows you to run your in-application Active Record validations as well as additional data audit validations across all records in a table or database at will. This is particular helpful in evolving validations and finding which records will no longer pass validation, as well as periodically performing more extensive audit validations without the real-time cost. Additional validations can be written to run only during the audit, or can be also be run during create. This allows time to migrate existing data to the new validation requirements while ensuring new data meets the current validation standards.

DataJanitor also augments your ActiveRecord models (at rake-task runtime) to allow for running project-wide common validations on the data. Thus, it will look at all models in your repository and tell you whether any data was stored that potentially violates the common project data formats. This can happen if a model does not have enough validations of its own or if its validations are not strict enough.

Installation

Add this line to your application's Gemfile:

gem 'data_janitor'

And then execute:

$ bundle

Or install it yourself as:

$ gem install data_janitor

Usage

Custom Model Level Validations

ActiveRecord validations for Audit Only

class SomeModel < ActiveRecord::Base
  extend DataJanitor::AuditValidatable

  dj_audit_validations do
    # Desired validations
    validates :country, inclusion: { in: ['US', 'AU', 'NZ'] }
  end
end

These validations only run when validating with an the ActiveRecord context :dj_audit is included, as in rec.invalid?(:dj_audit), so they normally will only be run by the DJ rake tasks.

ActiveRecord validations for Audit and Newly Created Records

class SomeModel < ActiveRecord::Base
  extend DataJanitor::AuditValidatable

  dj_validations do
    validates :name, length: { maximum: 25 }
  end
end

These validations are run when validating during create and with an the ActiveRecord context :dj_audit is included, as in rec.invalid?(:dj_audit), so they are run by the application as well as by the DJ rake tasks.

Rake Tasks

To audit the data defined by the ActiveRecord models in your repository:

rake data_janitor:audit
rake data_janitor:audit['some/file/path.json']
rake data_janitor:audit['some/file/path.json',true]

This will audit your DB for errors and output them to tmp/data_janitor_results.json (by default), or a specified path. The report will contain a list of errors with IDs of invalid records for each model. Including true will also display the output at the console.

You can also audit a specific model rather than all models found in the repository:

rake data_janitor:audit_model[SomeModel]

To apply common fixes to all models in your repository:

rake data_janitor:cleanse

To apply common fixes to just one model:

rake data_janitor:cleanse[SomeModel]

This will apply all the fixes that do not require semantic analysis of the data (e.g. replace nil values with "" for strings)

Data Janitor has the experimental ability to perform some built-in type checks, only a small part of which is implemented and currently tends to be noisy when on. It currently defaults to off. Each audit command takes an option string that defaults to 'no-type-checks' and can be set other colon-separated values to turn off specific checks. The values are: type-checks (or any other string not below) no-type-checks no-boolean no-decimal no-float no-integer no-string no-text no-array

For example:

rake data_janitor:audit_model[SomeModel,no-string:no-boolean]
rake data_janitor:audit[tmp/out.json,false,false,no-string:no-boolean]

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/westfield/data_janitor.