Class: PDF::Reader::Page

Inherits:

Object

Object
PDF::Reader::Page

show all

Includes:: ResourceMethods

Defined in:: lib/pdf/reader/page.rb

Overview

high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.

If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.

Instance Attribute Summary collapse

#cache ⇒ Object readonly

a Hash-like object for storing cached data.
#objects ⇒ Object readonly

lowlevel hash-like access to all objects in the underlying PDF.
#page_object ⇒ Object readonly

the raw PDF object that defines this page.

Instance Method Summary collapse

#attributes ⇒ Object

Returns the attributes that accompany this page, including attributes inherited from parents.
#initialize(objects, pagenum, options = {}) ⇒ Page constructor

creates a new page wrapper.
#inspect ⇒ Object

return a friendly string representation of this page.
#number ⇒ Object

return the number of this page within the full document.
#raw_content ⇒ Object

returns the raw content stream for this page.
#text ⇒ Object (also: #to_s)

returns the plain text content of this page encoded as UTF-8.
#walk(*receivers) ⇒ Object

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.

Methods included from ResourceMethods

#color_spaces, #fonts, #graphic_states, #patterns, #procedure_sets, #properties, #shadings, #xobjects

Constructor Details

#initialize(objects, pagenum, options = {}) ⇒ `Page`

creates a new page wrapper.

objects - an ObjectHash instance that wraps a PDF file
pagenum - an int specifying the page number to expose. 1 indexed.

# File 'lib/pdf/reader/page.rb', line 33

def initialize(objects, pagenum, options = {})
  @objects, @pagenum = objects, pagenum
  @page_object = objects.deref(objects.page_references[pagenum - 1])
  @cache       = options[:cache] || {}

  unless @page_object.is_a?(::Hash)
    raise ArgumentError, "invalid page: #{pagenum}"
  end
end

Instance Attribute Details

#cache ⇒ `Object` (readonly)

a Hash-like object for storing cached data. Generally this is scoped to the current document and is used to avoid repeating expensive operations



26
27
28

# File 'lib/pdf/reader/page.rb', line 26

def cache
  @cache
end

#objects ⇒ `Object` (readonly)

lowlevel hash-like access to all objects in the underlying PDF



18
19
20

# File 'lib/pdf/reader/page.rb', line 18

def objects
  @objects
end

#page_object ⇒ `Object` (readonly)

the raw PDF object that defines this page



21
22
23

# File 'lib/pdf/reader/page.rb', line 21

def page_object
  @page_object
end

Instance Method Details

#attributes ⇒ `Object`

Returns the attributes that accompany this page, including attributes inherited from parents.

# File 'lib/pdf/reader/page.rb', line 58

def attributes
  @attributes ||= {}.tap { |hash|
    page_with_ancestors.reverse.each do |obj|
      hash.merge!(@objects.deref(obj))
    end
  }
  # This shouldn't be necesary, but some non compliant PDFs leave MediaBox
  # out. Assuming 8.5" x 11" is what Acobat does, so we do it too.
  @attributes[:MediaBox] ||= [0,0,612,792]
  @attributes
end

#inspect ⇒ `Object`

return a friendly string representation of this page



51
52
53

# File 'lib/pdf/reader/page.rb', line 51

def inspect
  "<PDF::Reader::Page page: #{@pagenum}>"
end

#number ⇒ `Object`

return the number of this page within the full document



45
46
47

# File 'lib/pdf/reader/page.rb', line 45

def number
  @pagenum
end

#raw_content ⇒ `Object`

returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.

# File 'lib/pdf/reader/page.rb', line 111

def raw_content
  contents = objects.deref(@page_object[:Contents])
  [contents].flatten.compact.map { |obj|
    objects.deref(obj)
  }.map { |obj|
    obj.unfiltered_data
  }.join(" ")
end

#text ⇒ `Object` Also known as: to_s

returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯

# File 'lib/pdf/reader/page.rb', line 73

def text
  receiver = PageTextReceiver.new
  walk(receiver)
  receiver.content
end

#walk(*receivers) ⇒ `Object`

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.

This is mostly low level and you can probably ignore it unless you need access to something like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method

If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.

Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.

It may help to think of each page as a self contained program made up of a set of instructions and associated resources. Calling walk() executes the program in the correct order and calls out to your implementation.

# File 'lib/pdf/reader/page.rb', line 103

def walk(*receivers)
  callback(receivers, :page=, [self])
  content_stream(receivers, raw_content)
end

Class: PDF::Reader::Page

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from ResourceMethods

Constructor Details

#initialize(objects, pagenum, options = {}) ⇒ Page

Instance Attribute Details

#cache ⇒ Object (readonly)

#objects ⇒ Object (readonly)

#page_object ⇒ Object (readonly)

Instance Method Details

#attributes ⇒ Object

#inspect ⇒ Object

#number ⇒ Object

#raw_content ⇒ Object

#text ⇒ Object Also known as: to_s

#walk(*receivers) ⇒ Object