HSQL

"Hash (of data) and SQL" is a library that parses .sql files with YAML front matter. This allows analysts and other non-developers to write and develop ETLs without having to write source code but still giving them the power of specifying variables to interpolate into the SQL and other metadata that the program executing the SQL can use.

How to use this

Rather than specifying variables and metadata for a set of database queries in a .rb,.py or other programming language source file the queries should be written to a .sql file directly.

# filename: daily_summary.sql
owner: jackdanger
schedule: hourly
data:
  production:
    output_table: summaries
    update_condition:
  development:
    output_table: jackdanger_summaries
    update_condition: WHERE 1 <> 1
---
INSERT INTO {{{output_table}}} SELECT * FROM interesting_information;
UPDATE summaries_performed SET complete = 1 {{{update_condition}}};

The above is a SQL file and any text editor will allow analysts to use code completion and syntax highlighting for their queries.

The data hash in the YAML front matter lists a set of variables, by environment, that can be interpolated into the SQL queries. To render the queries an environment must be provided.

$ hsql daily_summary.sql development
USE some_database;
INSERT INTO jackdanger_summaries SELECT * FROM interesting_information;
UPDATE summaries_performed SET complete = 1 WHERE 1 <> 1;

The hsql command-line utility allows these SQL source files to be easily run in the context of some other application that understands when and where to execute the queries.

To access the metadata directly there is a simple programmatic API:

>> file = HSQL.parse_file('./daily_summary.sql', 'development')
>> file.queries
=> [
  "USE some_database;",
  "INSERT INTO jackdanger_summaries SELECT * FROM interesting_information;",
  "UPDATE summaries_performed SET complete = 1 WHERE 1 <> 1;"
]

The object returned from HSQL.parse_file provides you access to both the rendered queries and the data specified in the front matter. You can use this to schedule the queries, to run them and send failure notices to a list of watchers. It's a general-purpose store of data for the query author to configure whatever system you use to run ETL queries.

>> file = HSQL.parse_file('./daily_summary.sql', 'development')
>> file.yaml
=> {
  'owner' => 'jackdanger',
  'schedule' => 'hourly',
  'data' => {
    'production' => {
      'output_table' => 'summaries',
      'update_condition' => nil,
    },
    'development' => {
      'output_table' => 'jackdanger_summaries',
      'update_condition' => 'WHERE 1 <> 1',
    },
  }
}