Dreader
A simple DSL built on top of Roo to read and process tabular data (CSV, LibreOffice, Excel).
This gem allows you to:
- specify the structure of some tabular data you want to process
- debug and check correctness of the data you read
- read a file and process it, that is, execute code for each cell and each row of the file
If your data require elaborations which cannot be performed line by line, you can also access all the data read by the gem and manipulate it as you need.
The input data can be in CSV (comma or tab separated), LibreOffice, and Excel.
We use it to import data into Rails application, but the gem can used in any Ruby application.
The gem should be relatively easy to use, despite its name: dread stands for *d*ata *r*eader.
The gem depends on roo, from which we leverage all data
reading/parsing facilities and which allows us to achieve what we want
in about 250 lines of code.
Installation
Add this line to your application's Gemfile:
gem 'dreader'
And then execute:
$ bundle
Or install it yourself as:
$ gem install dreader
Usage
Declare what file you want to read
Require dreader and declare an instance of the Dreader::Engine class:
require 'dreader'
i = Dreader::Engine.new
Specify parsing option, using the following syntax:
i. do
filename 'example.ods'
sheet 'Sheet 1'
first_row 1
last_row 20
end
where:
- (optional)
filenameis the file to read. If not specified, you will have to supply a filename when loading the file (seeread, below). The extension determines the file type. Usetsvfor tab-separated files. - (optional)
first_rowis the first line to read (use2if your file has a header) - (optional)
last_rowis the last line to read. If not specified, we will rely onrooto determine the last row - (optional)
sheetis the sheet name or number to read from. If not specified, the first (default) sheet is used
Declare the columns you want to read
Declare the columns you want to read by assigning them a name and a column reference:
# we will access column A in Ruby code using :name
i.column :name do
colref 'A'
end
You can also specify two ruby blocks, process and check to
preprocess data and to check for errors.
For instance, given the following file:
| Name | Date of birth |
|---|---|
| Forest Whitaker | July 15, 1961 |
| Daniel Day-Lewis | April 29, 1957 |
| Sean Penn | August 17, 1960 |
we could use the following declaration to specify the data to read:
# we want to access column 1 using :name
# :name should be non nil and of length greater than 0
i.column :name do
colref 1
check do |x|
x and x.length > 0
end
end
# we want to access column 2 (Date of birth) using :birthdate
i.column :birthdate do
colref 2
# make sure the column is transformed into an integer
process do |x|
Date.parse(x)
end
# check age is a date (check is invoked on the value returned
# by process)
check do |x|
x.class == Date
end
end
# we don't care about any other column (and, therefore,
# we are done with our declarations)
If there are different columns that you want to read and process in
the same way, you can use the method bulk_declare, which accepts a
hash as input.
For instance:
i.bulk_declare {a: 'A', b: 'B'}
is equivalent to:
i.column :a do
colref 'A'
end
i.column :b do
colref 'B'
end
The method also accepts a code block, which allows to define a common
process function for all columns. In case, don't forget to put
the hash in parentheses, or the Ruby parser won't be able to
distinguish the hash from the code block. For instance:
i.bulk_declare({a: 'A', b: 'B'}) do
process do |cell|
...
end
end
There is an example of bulk_declare in the examples directory:
(us_cities_bulk_declare.rb).
Remarks:
the column name can be anything ruby can use as a Hash key. You can use symbols, strings, and even object instances, if you wish to do so.
colrefcan be a string (e.g.,'A') or an integer, in which case the first column is oneyou need to declare only the columns you want to import. For instance, we could skip the declaration for column 1, if 'Date of Birth' is the only data we want to import
If
processandcheckare specified, thencheckwill receive the result of invokingprocesson the cell value. This makes sense if process is used to make the cell value more accessible to ruby code (e.g., transforming a string into an integer).
Add virtual columns, if you want
Sometimes it is convenient to aggregate or otherwise manipulate the data read from each row before doing the actual processing.
For instance, we might have a table with dates of birth, while we are really interested in the age of people.
In such cases, we can use virtual column. A virtual column allows one to add a column to the data read. The value of the column for each row is computed using the values of other cells.
Virtual columns are declared similar to columns. Thus, for instance,
the following declaration adds an age column to each row of the data
we read from the previous example:
i.virtual_column :age do
process do |row|
# `compute_birthday` has to be defined
compute_birthday(row[:birthdate])
end
end
Virtual columns are, of course, available to the mapping directive
(see below).
Specify how to process data
Finally we can specify how we process lines, using the mapping
directive. Mapping takes an arbitrary piece of ruby code, which can
reference the fields of a row.
For instance:
i.mapping do |row|
puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
end
Notice that the data read from each row of our input data is stored in
a hash. The hash uses column names as the primary key and stores
the values in the :value key.
Start working with the data
We are now all set and we can start working with the data.
First use read or load (synonyms), to read all data and put it
into a @table instance variable.
i.read
After reading the file we can use errors to see whether any of the
check functions failed:
array_of_strings = i.errors
array_of_strings ech do |error_line|
puts error_line
end
We can then use virtual_columns to process data and generate the
virtual columns:
i.virtual_columns
Finally we can use the process function to execute the mapping
directive to each line read from the file.
i.process
Look in the examples directory for further details and a couple of working examples.
Remark. You can override some of the defaults by passing a hash as argument to read. For instance:
i.read filename: another_filepath
will read data from another_filepath, rather than from the filename
specified in the options. This might be useful, for instance, if the
same specification has to be used for different files.
Digging deeper
If you need to perform more elaborations on the data which cannot be
captured with process (that is, by processing the data row by row),
you can also directly access all data read, using the table method:
i.read
i.table
# an array of hashes (one hash per row)
More in details, the read method fills a @table instance variable
with an array of hashes. Each hash represents a line of the file.
Each hash contains one key per column, following your specification. Its value is, in turn, a hash with the following structure:
{
value: ..., # the result of calling process on the cell
row_number: ... # the row number
col_number: ... # the column number
error: ... # the result of calling check on the cell processed value
}
(Note that virtual columns only store value and a Boolean virtual,
which is always true.)
Thus, for instance, given the example above:
i.table
[ { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
age: { value: 30, row_number: 1, col_number: 2, errors: nil } },
{ name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
age: { value: 31, row_number: 2, col_number: 2, errors: nil } } ]
Simplifying the hash with the data read
The Dreader::Util class provides some functions to simplify and
restructure the hashes built by dreader.
Dreader::Util.simplify hash simplifies the hash passed as input by
removing all information but the value and making the value
accessible directly from the name of the column.
Dreader::Util.simplify i.table[0]
{name: "John", age: 30}
Dreader::Util.slice hash, keys and Dreader::Util.slice hash,
keys, where keys is an arrays of keys, are respectively used to
select or remove some keys from hash.
i.table[0]
{ name: { value: "John", row_number: 1, col_number: 1, errors: nil },
age: { value: 30, row_number: 1, col_number: 2, errors: nil }}
Dreader::Util.slice i.table[0], :name
{name: { value: "John", row_number: 1, col_number: 1, errors: nil}
Dreader::Util.clean i.table[0], :name
{age: { value: 30, row_number: 1, col_number: 2, errors: nil }
The methods slice and clean are more useful when used in
conjuction with simplify:
hash = Dreader::Util.simplify i.table[0]
{name: "John", age: 30}
Dreader::Util.slice hash, [:age]
{age: 30}
Dreader::Util.clean hash, [:age]
{name: "John"}
Notice that the output produced by slice and simplify is a has
which can be used to create an ActiveRecord object.
Finally, the Dreader::Util.restructure method helps building hashes
to create
ActiveModel
objects with nested attributes:
hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }
Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
{name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}
Debugging your specification
If you are not sure about what is going on (like I often am when
reading tabular data), you can use the debug function, which prints
the current configuration, reads some records from your files, and
shows them to standard output:
i.debug
i.debug n: 40 # read 40 lines (from first_row, if the option is declared)
i.debug n: 40, filename: filepath # like above, but read from filepath
Another possibility is getting the value of the @table variable,
which contains all the data read.
By default debug invokes the process and check directives. Pass
the following options, if you want to disable this behavior; this
might be useful, for instance, if you intend to check only what data
is read:
i.debug process: false, debug: false
Notice that check implies process, since check is invoked on the
output of the process directive.`
Changelog
See [[Changelog]].
Known Limitations
At the moment:
- it is not possible to specify column references using header names (like Roo does).
- it is not possible to pass options to the file readers. As a
consequence tab-separated files must have the
.tsvextension to be correctly parsed. - some testing wouldn't hurt.
Known Bugs
No known bugs and an unknown number of unknown bugs.
(See the open issues for the known bugs.)
Development
After checking out the repo, run bin/setup to install dependencies. You can
also run bin/console for an interactive prompt that will allow you to
experiment.
To install this gem onto your local machine, run bundle exec rake
install. To release a new version, update the version number in version.rb,
and then run bundle exec rake release, which will create a git tag for the
version, push git commits and tags, and push the .gem file to
rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/avillafiorita/dreader.
License
The gem is available as open source under the terms of the MIT License.