Dreader

A simple DSL built on top of Roo to read and process tabular data (CSV, LibreOffice, Excel).

This gem allows you to:

  1. specify the structure of some tabular data you want to process
  2. debug and check correctness of the data you read
  3. read a file and process it, that is, execute code for each cell and each row of the file

If your data require elaborations which cannot be performed line by line, you can also access all the data read by the gem and manipulate it as you need.

The input data can be in CSV (comma or tab separated), LibreOffice, and Excel.

We use it to import data into Rails application, but the gem can used in any Ruby application.

The gem should be relatively easy to use, despite its name: dread stands for *d*ata *r*eader.

The gem depends on roo, from which we leverage all data reading/parsing facilities and which allows us to achieve what we want in about 250 lines of code.

Installation

Add this line to your application's Gemfile:

gem 'dreader'

And then execute:

$ bundle

Or install it yourself as:

$ gem install dreader

Usage

Declare what file you want to read

Require dreader and declare an instance of the Dreader::Engine class:

require 'dreader'

i = Dreader::Engine.new

Specify parsing option, using the following syntax:

i.options do
  filename 'example.ods'

  sheet 'Sheet 1'

  first_row 1
  last_row 20
end

where:

  • (optional) filename is the file to read. If not specified, you will have to supply a filename when loading the file (see read, below). The extension determines the file type. Use tsv for tab-separated files.
  • (optional) first_row is the first line to read (use 2 if your file has a header)
  • (optional) last_row is the last line to read. If not specified, we will rely on roo to determine the last row
  • (optional) sheet is the sheet name or number to read from. If not specified, the first (default) sheet is used

Declare the columns you want to read

Declare the columns you want to read by assigning them a name and a column reference:

# we will access column A in Ruby code using :name
i.column :name do
  colref 'A'
end

You can also specify two ruby blocks, process and check to preprocess data and to check for errors.

For instance, given the following file:

Name Date of birth
Forest Whitaker July 15, 1961
Daniel Day-Lewis April 29, 1957
Sean Penn August 17, 1960

we could use the following declaration to specify the data to read:

# we want to access column 1 using :name
# :name should be non nil and of length greater than 0
i.column :name do
  colref 1
  check do |x|
    x and x.length > 0
  end
end

# we want to access column 2 (Date of birth) using :birthdate
i.column :birthdate do
  colref 2

  # make sure the column is transformed into an integer
  process do |x|
    Date.parse(x)
  end

  # check age is a date (check is invoked on the value returned
  # by process)
  check do |x|
    x.class == Date
  end
end 

# we don't care about any other column (and, therefore, 
# we are done with our declarations)

If there are different columns that you want to read and process in the same way, you can use the method bulk_declare, which accepts a hash as input.

For instance:

i.bulk_declare {a: 'A', b: 'B'}

is equivalent to:

i.column :a do
  colref 'A'
end

i.column :b do
  colref 'B'
end

The method also accepts a code block, which allows to define a common process function for all columns. In case, don't forget to put the hash in parentheses, or the Ruby parser won't be able to distinguish the hash from the code block. For instance:

i.bulk_declare({a: 'A', b: 'B'}) do
  process do |cell|
    ...
  end
end

There is an example of bulk_declare in the examples directory: (us_cities_bulk_declare.rb).

Remarks:

  1. the column name can be anything ruby can use as a Hash key. You can use symbols, strings, and even object instances, if you wish to do so.

  2. colref can be a string (e.g., 'A') or an integer, in which case the first column is one

  3. you need to declare only the columns you want to import. For instance, we could skip the declaration for column 1, if 'Date of Birth' is the only data we want to import

  4. If process and check are specified, then check will receive the result of invoking process on the cell value. This makes sense if process is used to make the cell value more accessible to ruby code (e.g., transforming a string into an integer).

Add virtual columns, if you want

Sometimes it is convenient to aggregate or otherwise manipulate the data read from each row before doing the actual processing.

For instance, we might have a table with dates of birth, while we are really interested in the age of people.

In such cases, we can use virtual column. A virtual column allows one to add a column to the data read. The value of the column for each row is computed using the values of other cells.

Virtual columns are declared similar to columns. Thus, for instance, the following declaration adds an age column to each row of the data we read from the previous example:

i.virtual_column :age do
  process do |row|
    # `compute_birthday` has to be defined
    compute_birthday(row[:birthdate])
  end
end

Virtual columns are, of course, available to the mapping directive (see below).

Specify how to process data

Finally we can specify how we process lines, using the mapping directive. Mapping takes an arbitrary piece of ruby code, which can reference the fields of a row.

For instance:

i.mapping do |row|
  puts "#{row[:name][:value]} is #{row[:age][:value]} years old"
end

Notice that the data read from each row of our input data is stored in a hash. The hash uses column names as the primary key and stores the values in the :value key.

Start working with the data

We are now all set and we can start working with the data.

First use read or load (synonyms), to read all data and put it into a @table instance variable.

i.read

After reading the file we can use errors to see whether any of the check functions failed:

array_of_strings = i.errors
array_of_strings ech do |error_line|
  puts error_line
end

We can then use virtual_columns to process data and generate the virtual columns:

i.virtual_columns

Finally we can use the process function to execute the mapping directive to each line read from the file.

i.process

Look in the examples directory for further details and a couple of working examples.

Remark. You can override some of the defaults by passing a hash as argument to read. For instance:

i.read filename: another_filepath

will read data from another_filepath, rather than from the filename specified in the options. This might be useful, for instance, if the same specification has to be used for different files.

Digging deeper

If you need to perform more elaborations on the data which cannot be captured with process (that is, by processing the data row by row), you can also directly access all data read, using the table method:

i.read
i.table
# an array of hashes (one hash per row)

More in details, the read method fills a @table instance variable with an array of hashes. Each hash represents a line of the file.

Each hash contains one key per column, following your specification. Its value is, in turn, a hash with the following structure:

{ 
  value: ...,     # the result of calling process on the cell
  row_number: ... # the row number
  col_number: ... # the column number
  error: ...      # the result of calling check on the cell processed value
}

(Note that virtual columns only store value and a Boolean virtual, which is always true.)

Thus, for instance, given the example above:

i.table
[ { name: { value: "John", row_number: 1, col_number: 1, errors: nil },
    age:  { value: 30, row_number: 1, col_number: 2, errors: nil } },
  { name: { value: "Jane", row_number: 2, col_number: 1, errors: nil },
    age:  { value: 31, row_number: 2, col_number: 2, errors: nil } } ]

Simplifying the hash with the data read

The Dreader::Util class provides some functions to simplify and restructure the hashes built by dreader.

Dreader::Util.simplify hash simplifies the hash passed as input by removing all information but the value and making the value accessible directly from the name of the column.

Dreader::Util.simplify i.table[0]
{name: "John", age: 30}

Dreader::Util.slice hash, keys and Dreader::Util.slice hash, keys, where keys is an arrays of keys, are respectively used to select or remove some keys from hash.

i.table[0]
{ name: { value: "John", row_number: 1, col_number: 1, errors: nil },
  age:  { value: 30, row_number: 1, col_number: 2, errors: nil }}

Dreader::Util.slice i.table[0], :name
{name: { value: "John", row_number: 1, col_number: 1, errors: nil}

Dreader::Util.clean i.table[0], :name
{age:  { value: 30, row_number: 1, col_number: 2, errors: nil }

The methods slice and clean are more useful when used in conjuction with simplify:

hash = Dreader::Util.simplify i.table[0]
{name: "John", age: 30}

Dreader::Util.slice hash, [:age]
{age: 30}

Dreader::Util.clean hash, [:age]
{name: "John"}

Notice that the output produced by slice and simplify is a has which can be used to create an ActiveRecord object.

Finally, the Dreader::Util.restructure method helps building hashes to create ActiveModel objects with nested attributes:

hash = {name: "John", surname: "Doe", address: "Unknown", city: "NY" }

Dreader::Util.restructure hash, [:name, :surname], :address_attributes, [:address, :city]
{name: "John", surname: "Doe", address_attributes: {address: "Unknonw", city: "NY"}}

Debugging your specification

If you are not sure about what is going on (like I often am when reading tabular data), you can use the debug function, which prints the current configuration, reads some records from your files, and shows them to standard output:

i.debug
i.debug n: 40 # read 40 lines (from first_row, if the option is declared)
i.debug n: 40, filename: filepath # like above, but read from filepath

Another possibility is getting the value of the @table variable, which contains all the data read.

By default debug invokes the process and check directives. Pass the following options, if you want to disable this behavior; this might be useful, for instance, if you intend to check only what data is read:

i.debug process: false, debug: false

Notice that check implies process, since check is invoked on the output of the process directive.`

Changelog

See [[Changelog]].

Known Limitations

At the moment:

  • it is not possible to specify column references using header names (like Roo does).
  • it is not possible to pass options to the file readers. As a consequence tab-separated files must have the .tsv extension to be correctly parsed.
  • some testing wouldn't hurt.

Known Bugs

No known bugs and an unknown number of unknown bugs.

(See the open issues for the known bugs.)

Development

After checking out the repo, run bin/setup to install dependencies. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/avillafiorita/dreader.

License

The gem is available as open source under the terms of the MIT License.