The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
It provides programmatic access to the contents of a PDF file with a high degree of flexibility.
The PDF 1.7 specification is a weighty document and not all aspects are currently supported. We welcome submission of PDF files that exhibit unsupported aspects of the spec to assist with improving out support.
Installation
The recommended installation method is via Rubygems.
gem install pdf-reader
Usage
PDF::Reader is designed with a callback-style architecture. The basic concept is to build a receiver class and pass that into PDF::Reader along with the PDF to process.
As PDF::Reader walks the file and encounters various objects (pages, text, images, shapes, etc) it will call methods on the receiver class. What those methods do is entirely up to you - save the text, extract images, count pages, read metadata, whatever.
For a full list of the supported callback methods and a description of when they will be called, refer to PDF::Reader::Content. See the code examples below for a way to print a list of all the callbacks generated by a file to STDOUT.
Text Encoding
Internally, text can be stored inside a PDF in various encodings, including zingbats, win-1252, mac roman and a form of Unicode. To avoid confusion, all text will be converted to UTF-8 before it is passed back from PDF::Reader.
Exceptions
There are two key exceptions that you will need to watch out for when processing a PDF file:
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the file should be valid, or that a corrupt file didn’t raise an exception, please forward a copy of the file to the maintainers and we can attempt improve the code.
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn’t currently support. Again, we welcome submissions of PDF files that exhibit these features to help us with future code improvements.
Any other exceptions should be considered bugs and should be reported (unless they originate inside your receiver, in which case you’re on your own)
Maintainers
-
Peter Jones <[email protected]>
-
James Healy <[email protected]>
Mailing List
Any questions or feedback should be sent to the PDF::Reader google group.
groups.google.com/group/pdf-reader
Examples
The easiest way to explain how this works in practice is to show some examples.
Page Counter
A simple app to count the number of pages in a PDF File.
require 'rubygems'
require 'pdf/reader'
class PageReceiver
attr_accessor :page_count
def initialize
@page_count = 0
end
# Called when page parsing ends
def end_page
@page_count += 1
end
end
receiver = PageReceiver.new
pdf = PDF::Reader.file("somefile.pdf", receiver)
puts "#{receiver.page_count} pages"
List all callbacks generated by a single PDF
WARNING: this will generate a lot of output, so you probably want to pipe it through less or to a text file.
require 'rubygems'
require 'pdf/reader'
receiver = PDF::Reader::RegisterReceiver.new
pdf = PDF::Reader.file("somefile.pdf", receiver)
receiver.callbacks.each do |cb|
puts cb
end
Extract metadata only
require 'rubygems'
require 'pdf/reader'
class MetaDataReceiver
attr_accessor :regular
attr_accessor :xml
def (data)
@regular = data
end
def (data)
@xml = data
end
end
receiver = MetaDataReceiver.new
pdf = PDF::Reader.file(ARGV.shift, receiver, :pages => false, :metadata => true)
puts receiver.regular.inspect
puts receiver.xml.inspect
Basic RSpec of a generated PDF
require 'rubygems'
require 'pdf/reader'
require 'pdf/writer'
require 'spec'
class PageTextReceiver
attr_accessor :content
def initialize
@content = []
end
# Called when page parsing starts
def begin_page(arg = nil)
@content << ""
end
def show_text(string, *params)
@content.last << string.strip
end
# there's a few text callbacks, so make sure we process them all
alias :super_show_text :show_text
alias :move_to_next_line_and_show_text :show_text
alias :set_spacing_next_line_show_text :show_text
def show_text_with_positioning(*params)
params = params.first
params.each { |str| show_text(str) if str.kind_of?(String)}
end
end
context "My generated PDF" do
specify "should have the correct text on 2 pages" do
# generate our PDF
pdf = PDF::Writer.new
pdf.text "Chunky", :font_size => 32, :justification => :center
pdf.start_new_page
pdf.text "Bacon", :font_size => 32, :justification => :center
pdf.save_as("chunkybacon.pdf")
# process the PDF
receiver = PageTextReceiver.new
PDF::Reader.file("chunkybacon.pdf", receiver)
# confirm the text appears on the correct pages
receiver.content.size.should eql(2)
receiver.content[0].should eql("Chunky")
receiver.content[1].should eql("Bacon")
end
end
Extract ISBNs
Parse all text in the requested PDF file and print out any valid book ISBNs. Requires the rbook-isbn gem.
require 'rubygems'
require 'pdf/reader'
require 'rbook/isbn'
class ISBNReceiver
# there's a few text callbacks, so make sure we process them all
def show_text(string, *params)
process_words(string.split(/\W+/))
end
def super_show_text(string, *params)
process_words(string.split(/\W+/))
end
def move_to_next_line_and_show_text (string)
process_words(string.split(/\W+/))
end
def set_spacing_next_line_show_text (aw, ac, string)
process_words(string.split(/\W+/))
end
private
# check if any items in the supplied array are a valid ISBN, and print any
# that are to console
def process_words(words)
words.each do |word|
word.strip!
puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word)
end
end
end
receiver = ISBNReceiver.new
PDF::Reader.file("somefile.pdf", receiver)
Known Limitations
The order of the callbacks is unpredicable, and is dependent on the internal layout of the file, not the order objects are displayed to the user. As a consequence of this it is highly unlikely that text will be completely in order.
Occasionally some text cannot be extracted properly due to the way it has been stored, or the use of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate an unrecognisable character.
Resources
-
PDF::Reader Homepage: software.pmade.com/pdfreader
-
PDF::Reader Rubyforge Page: rubyforge.org/projects/pdf-reader/
-
PDF Specification: www.adobe.com/devnet/pdf/pdf_reference.html
-
PDF Tutorial Slide Presentations: home.comcast.net/~jk05/presentations/PDFTutorials.html