PDF::Reader::Turtletext <img src=“https://secure.travis-ci.org/tardate/pdf-reader-turtletext.png” />

PDF::Reader::Turtletext is an extension for the most excellent PDF::Reader gem.

The aim of Turtletext is to provide simple and convenient methods for extracting PDF text content and converting it to structured data - even when there is no explicit structure in the original PDF source.

A typical use is to extract details from utility bills that are provided in PDF format, to open up the data for analysis and other secondary uses.

For an example of how this is works in practice, see the sps_bill gem (which is in fact the project where the original ideas for Turtletext gestated).

Requirements and Known Limitations

  • currently only tested with Ruby 1.9

  • fixed dependency on PDF::Reader v 1.1.1

Installation

gem install pdf-reader-turtletext

Usage

PDF::Reader::Turtletext

Provides a range of methods to extract structured text from a PDF file, such as text_position and text_in_region.

A typical usage:

reader = PDF::Reader::Turtletext.new(pdf_filename)
page = 1
heading_position = reader.text_position(/transaction table/i)
next_section = reader.text_position(/transaction summary/i)
transaction_rows = reader.text_in_region(
  heading_position[x], 900,
  heading_position[y] + 1,next_section[:y] -1
)

Contributing to PDF::Reader::Turtletext

  • Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet

  • Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it

  • Fork the project

  • Start a feature/bugfix branch

  • Commit and push until you are happy with your contribution

  • Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Copyright © 2012 Paul Gallagher. See LICENSE for further details.