The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

It provides programmatic access to the contents of a PDF file with a high degree of flexibility.

The PDF 1.7 specification is a weighty document and not all aspects are currently supported. I welcome submission of PDF files that exhibit unsupported aspects of the spec to assist with improving out support.

Development Status

I adopted this library in 2007 when I was learning the fundamentals of the PDF spec. I do not currently use it in my day to day work and I just don’t have the spare time to dedicate to adding new features.

The code as it is works fairly well, and I offer it “as is”. All patches, bug reports and sample PDFs are welcome - I will work on them when I can. If anyone is interested in adding features to PDF::Reader in their own effort to learn the PDF file format, I’ll happy offer help and support.

Installation

The recommended installation method is via Rubygems.

gem install pdf-reader

Usage

PDF::Reader is designed with a callback-style architecture. The basic concept is to build a receiver class and pass that into PDF::Reader along with the PDF to process.

As PDF::Reader walks the file and encounters various objects (pages, text, images, shapes, etc) it will call methods on the receiver class. What those methods do is entirely up to you - save the text, extract images, count pages, read metadata, whatever.

For a full list of the supported callback methods and a description of when they will be called, refer to PDF::Reader::Content. See the code examples below for a way to print a list of all the callbacks generated by a file to STDOUT.

There is also a class called PDF::Hash. This provides direct access to the objects in a PDF file using a ruby hash-like API. Checkout the documentation for the class for further information.

Text Encoding

Internally, text can be stored inside a PDF in various encodings, including zingbats, win-1252, mac roman and a form of Unicode. To avoid confusion, all text will be converted to UTF-8 before it is passed back from PDF::Reader.

Exceptions

There are two key exceptions that you will need to watch out for when processing a PDF file:

MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the file should be valid, or that a corrupt file didn’t raise an exception, please forward a copy of the file to the maintainers (preferably via the google group) and we can attempt to improve the code.

UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn’t currently support. Again, we welcome submissions of PDF files that exhibit these features to help us with future code improvements.

MalformedPDFError has some subclasses if you want to detect finer grained issues. If you don’t, ‘rescue MalformedPDFError’ will catch all the subclassed errors as well.

Any other exceptions should be considered bugs in either PDF::Reader (please report it!) or your receiver (please don’t report it!).

Maintainers

Licensing

This library is distributed under the terms of the MIT License. See the included file for more detail.

Mailing List

Any questions or feedback should be sent to the PDF::Reader google group. It’s better that any answers be available for others instead of hiding in someone’s inbox.

groups.google.com/group/pdf-reader

Examples

The easiest way to explain how this works in practice is to show some examples. Check out the examples/ directory for a few files.

Known Limitations

The order of the callbacks is unpredictable, and is dependent on the internal layout of the file, not the order objects are displayed to the user. As a consequence of this it is highly unlikely that text will be completely in order.

Occasionally some text cannot be extracted properly due to the way it has been stored, or the use of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate an unrecognisable character.

Resources