Pikelet

A pikelet is a small, delicious pancake popular in Australia and New Zealand. Also, the stage name of Australian musician Evelyn Morris. Also, a simple flat-file database parser capable of dealing with files containing heterogeneous records. Somehow you've wound up at the github page for the last one.

The reason I built Pikelet was to handle "HOT" files as described in the IATA BSP Data Interchange Specifications handbook. These are essentially flat-file databases comprised of a number of different fixed-width record types. Each record type has a different structure, though some types share common fields, and all types have a type signature.

However, Pikelet will also handle more typical flat-file databases comprised of homogeneous records. Additionally, it will work equally as well with CSV files as it will with fixed-width records.

Installation

Add this line to your application's Gemfile:

gem 'pikelet'

And then execute:

$ bundle

Or install it yourself as:

$ gem install pikelet

Usage

The simple case: homogeneous records

Let's say our file is a simple list of first and last names with each field being 10 characters in width, padded with spaces (vertical pipes used to indicate field boundaries).

|Nicolaus  |Copernicus|
|Tycho     |Brahe     |

We can describe this format using Pikelet as follows:

definition = Pikelet.define do
  first_name   0...10
  last_name   10...20
end

Each field is described with a field name and a range describing the field boundaries. You can use either the end-inclusive (..) or end-exclusive (...) form of range literals. I prefer the exclusive form for this.

Parsing the data is simple as this:

definition.parse(data)

data is assumed to be an enumerable object yielding successive lines from your file. For instance, you could do something like this:

records = definition.parse(IO.readlines(filepath))

or this:

records = File(filepath, 'r').do |f|
  definition.parse(f)
end

parse returns an enumerator, which you can either iterate over, or convert to an array, or whatever else you people do with enumerators. In any case, what you'll end up with is a series of Structs like this:

#<struct first_name="Nicolaus", last_name="Copernicus">,
#<struct first_name="Tycho", last_name="Brahe">

A more complex case: heterogeneous records

Now let's say we're given a file consisting of names and addresses, each record contains a 4-character type signature - 'NAME' for names, 'ADDR' for addresses:

|NAME|Nicolaus  |Copernicus|
|ADDR|123 South Street     |Nowhereville        |45678Y    |Someplace           |

We can describe it as follows:

Pikelet.define do
  type_signature 0...4

  record "NAME" do
    first_name  4...14
    last_name  14...24
  end

  record "ADDR" do
    street_address  4...24
    city           24...44
    postal_code    44...54
    state          54...74
  end
end

Note that the type signature is described as a field like any other, but it must have the name type_signature.

Each record type is described using record statements, which take the record's type signature as a parameter and a block describing its fields.

When we parse the data, we end up with this:

#<struct
  type_signature="NAME",
  first_name="Nicolaus",
  last_name="Copernicus">,
#<struct
  type_signature="ADDR",
  street_address="123 South Street",
  city="Nowhereville",
  postal_code="45678Y",
  state="Someplace">

Handling CSV files

What happens if we were given the data in the previous example in CSV form?

NAME,Nicolaus,Copernicus
ADDR,123 South Street,Nowhereville,45678Y,Someplace

In this case instead of describing fields with a boundary range, we just give it a simple (zero-based) index, like so:

Pikelet.define do
  type_signature 0

  record "NAME" do
    first_name 1
    last_name  2
  end

  record "ADDR" do
    street_address 1
    city           2
    postal_code    3
    state          4
  end
end

This yields the same results as above.

Note that this ability to handle CSV was not planned - it just sprang fully-formed from the implementation. One of those pleasant little surprises that happens sometimes. If only I had a use for it.

Inheritance

Now we go back to our original example, starting with a simple list of names, but this time some of the records include a nickname:

|PLAIN|Nicolaus  |Copernicus|
|FANCY|Tycho     |Brahe     |Tykester  |

The first and last name fields have the same boundaries in each case, but the "FANCY" records have an additional field. We can describe this by nesting the definition for FANCY records inside the definition for the PLAIN records:

Pikelet.define do
  type_signature 0...5

  record "PLAIN" do
    first_name  5...15
    last_name  15...25

    record "FANCY" do
      nickname 25...35
    end
  end
end

Note that the outer definition is really just a record definition in disguise, you might have already figured this out if you were paying attention.

Anyway, this is what we get when we parse it.

#<struct
  type_signature="SIMPLE",
  first_name="Nicolaus",
  last_name="Copernicus">,
#<struct
  type_signature="FANCY",
  first_name="Tycho",
  last_name="Brahe",
  nickname="Tykester">

Custom field parsing

Field definitions can accept a block. If provided, the field value is yielded to the block. This is useful for parsing numeric fields (say).

Pikelet.define do
  a_number(0...4) { |value| value.to_i }
end

You can also use shorthand syntax:

Pikelet.define do
  a_number 0...4, &:to_i
end

Thoughts/plans

With some work, Pikelet could produce flat file records as easily as it consumes them.
I had a crack at supporting lazy enumeration, and it kinda works. Sometimes. If the moon is in the right quarter. I'd like to get it working properly.

Contributing

Fork it (http://github.com/johncarney/pikelet/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request