rETL
Ruby Extract Transform Load
rETL is a gem with a rich DSL for ETL (extract, transform, load) projects in Ruby.
Transforming Data
The core construct for transforming data is called a Path. A path describes how data should be transformed. rETL has a rich DSL that is intended to work on Hash objects, but it is not limited strictly to Hashes.
Defining a Path
A path is composed of Steps. Each step is intended to act on a single unit of data that is passed into a block (like a row from a database query). The steps are executed in the order they are defined. In other words, the data passes through each step until it is completely transformed. In order to keep the Path easy to understand and maintain, each step should be responsible for a single action on a row.
This path will filter
data by :first_name
, and then transform
it by adding
a :full_name
key.
my_path = Retl::Path.new do
filter do |row|
row[:first_name] == "David"
end
transform do |row|
row[:full_name] = row[:first_name] + " " + row[:last_name]
end
end
Once a path is defined, it can be used to transform data with the
#transform(enumerable, options={})
method. A Transformation
object is
returned, which is Enumerable
.
data = [
{ first_name: "David" , last_name: "Biehl" },
{ first_name: "Indiana", last_name: "Jones" }
]
result = my_path.transform(data)
result.to_a
#=> [ { full_name: "David", last_name: "Biehl", full_name: "David Biehl" } ]
Available Steps
There are several steps available in the rETL DSL that will transform data in different ways.
Transform
The transform
step will mutate the data passed into the block. The return
value of a transform
step is ignored. See the :full_name
example above.
Replace
The replace
step will replace the data with the return value of the block. A
common use is to use replace
as the first step to convert incoming objects
into hashes.
users = [
["David", "Biehl", 33],
["Indiana", "Jones", 50]
]
my_path = Retl::Path.new do
replace do |row|
{ first_name: row[0], last_name: row[1], age: row[2] }
end
# perform other steps with the hash
end
result = my_path.transform(users)
result.to_a
#=> [
#=> {:first_name=>"David", :last_name=>"Biehl", :age=>33},
#=> {:first_name=>"Indiana", :last_name=>"Jones", :age=>50}
#=> ]
Filter & Reject
The filter
step uses a predicate block to determine if the data should proceed
to the next step. Conversely, reject
will discard data if the predicate is
truthy. select
is an alias for filter
.
data = [
{name: "David" , age: 33},
{name: "Indiana", age: 50},
{name: "Sully" , age: 7},
{name: "Boo" , age: 3}
]
my_path = Retl::Path.new do
transform do |row|
row[:adult_or_child] = row[:age] >= 18 ? "adult" : "child"
end
reject do |row|
row[:adult_or_child] == "child"
end
select do |row| # `select` is an alias of `filter`
row[:age].odd?
end
end
result = my_path.transform(data)
result.to_a
#=> [{:name=>"David", :age=>33, :adult_or_child=>"adult"}]
Calculate
The calculate
step will calculate a single key on a Hash with the return value
of the block. For example, calculate
can be used instead of transform
from
the first example. calc
is a short-hand alias for calculate
.
my_path = Retl::Path.new do
calculate(:full_name) do |row|
row[:first_name] + " " + row[:last_name]
end
end
Inspect
inspect
steps cannot change the incoming data, and the return value of the
block is ignored. These steps are intended for debugging, testing or logging
purposes.
# rspec example
it "adds a full name" do
rspec = self
my_path = Retl::Path.new do
calc(:full_name) do |row|
row[:first_name] + " " + row[:last_name]
end
inspect do |row|
rspec.expect(row).to rspec.include(full_name: "Indiana Jones")
end
end
my_path.transform([ { first_name: "Indiana", last_name: "Jones"} ]).to_a
end
Fork
Paths can be forked for alternate results with the fork(name)
step. The forked
data can then be accessed on the result with the #forks(name)
method. Forks
are unaffected by any steps that take place after the fork is defined.
data = [
{ name: "David" , age: 33 },
{ name: "Indiana", age: 50 },
{ name: "Sully" , age: 7 },
{ name: "Boo" , age: 3 }
]
my_path = Retl::Path.new do
transform do |row|
row[:adult_or_child] = row[:age] >= 18 ? "adult" : "child"
end
fork(:adults) do
filter do |row|
row[:adult_or_child] == "adult"
end
end
fork(:children) do
filter do |row|
row[:adult_or_child] == "child"
end
end
reject do |row|
true # oops, rejecting everything after the forks.
end
end
result = my_path.transform(data)
result.forks(:adults).to_a
#=> [ { name: "David", ... }, { name: "Indiana", ... } ]
result.forks(:children).to_a
#=> [ { name: "Sully", ... }, { name: "Boo", ... } ]
result.to_a
#> [ ]
Explode
The explode
step adds additional data to the Path. The return value of the
block should respond to #each
, like an Array.
my_path = Retl::Path.new do
explode do |number|
number.times.map { |x| x + x + x }
end
filter do |number|
number.odd?
end
end
result = my_path.transform([6])
result.to_a
#=> [3, 9, 15]
expect(result.to_a).to eq([3, 9, 15])
Path Reuse
In rETL, paths can be re-used with the path
step. Common Paths can be
defined to ensure that calculations yield consistent results throughout the
entire ETL project. Consistent data and meanings will make the data warehouse
easier to understand for data consumers.
AdultOrChild = Retl::Path.new do
calculate(:adult_or_child) do |row|
row[:age] >= 18 ? "adult" : "child"
end
end
my_path = Retl::Path.new do
path AdultOrChild
### perform other steps, if necessary
end
result = my_path.transform([{age: 3}])
result.to_a
#=> [ { age: 3, adult_or_child: "child" } ]
Dependencies
Dependencies can be defined with the depends_on(name)
. The value of the
dependency is accessible inside of each step by its name. In this example, we'll
define an age_lookup
dependency.
my_path = Retl::Path.new do
depends_on(:age_lookup) do # hint: the block returns the default value
{
"adult" => "Adults are 18 or older",
"child" => "Children are younger than 18"
}
end
path AdultOrChild # see previous example
calculate(:age_description) do |row|
age_lookup[row[:adult_or_child]]
end
end
result = my_path.transform([{age: 19}])
result.to_a
Dependency Injection
Dependencies can also be injected when the transformation takes place. This is useful for testing by passing in mocks or stubs. Also, concrete results from other paths can be merged merged into a single path making data integration possible.
my_path = Retl::Path.new do
depends_on(:age_lookup) # hint: without a block, the dependency must
# be provided when #transform is called
path AdultOrChild
calculate(:age_description) do |row|
age_lookup[row[:adult_or_child]]
end
end
age_lookup_hash = {
"adult" => "Adults are 18 or older",
"child" => "Children are younger than 18"
}
result = my_path.transform([{age: 4}], age_lookup: age_lookup_hash)
result.to_a
#=> [ { age: 4, adult_or_child: "child", age_description: "Children are younger than 18" } ]
Path Reuse with Dependencies
The AdultOrChild
path above isn't very flexible. It depends on an :age
key
to be present in the hash. What if our data uses a different key, like
:years_since_birth
? rETL can make this more flexible by adding dependencies.
FlexibleAdultOrChild = Retl::Path.new do
depends_on(:from)
depends_on(:to) do ||
[:to] || :adult_or_child # use a default value
end
transform do |row|
row[to] = row[from] >= 18 ? "adult" : "child"
end
end
path_with_age = Retl::Path.new do
path FlexibleAdultOrChild, from: :age
end
path_with_age_result = path_with_age.transform([{age: 33}])
path_with_age_result.to_a
#=> [{age: 33, adult_or_child: "adult"}]
path_with_years_since_birth = Retl::Path.new do
path FlexibleAdultOrChild do
{ from: :years_since_birth, to: :age_classification } # blocks work too
end
end
result = path_with_years_since_birth.transform([{years_since_birth: 7}])
result.to_a
#=> [{years_since_birth: 7, age_classification: "child"}]
Roadmap
Currently the rETL gem's strengths are transforming data and code reuse. However this is only one part of an ETL project. I haven't even started on extracting or loading. Fortunately, the contract for transformation is very simple.
path.transform(Enumerable)
#=> Enumerable
Enumerales in, Enumerales out. This makes the application of the gem pretty much universal for any type of data transformation requirement in Ruby.
Next Steps
- Error handling
- Tracing and logging
- Extract patterns
- Load patterns
Installation
Add this line to your application's Gemfile:
gem 'retl'
And then execute:
$ bundle
Or install it yourself as:
$ gem install retl
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/davidbiehl/retl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the MIT License.