Class: PDF::Reader::Page

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf/reader/page.rb

Overview

high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.

If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(objects, pagenum) ⇒ Page

creates a new page wrapper.

  • objects - an ObjectHash instance that wraps a PDF file

  • pagenum - an int specifying the page number to expose. 1 indexed.



27
28
29
30
31
32
33
34
# File 'lib/pdf/reader/page.rb', line 27

def initialize(objects, pagenum)
  @objects, @pagenum = objects, pagenum
  @page_object = objects.deref(objects.page_references[pagenum - 1])

  unless @page_object.is_a?(::Hash)
    raise ArgumentError, "invalid page: #{pagenum}"
  end
end

Instance Attribute Details

#objectsObject (readonly)

lowlevel hash-like access to all objects in the underlying PDF



17
18
19
# File 'lib/pdf/reader/page.rb', line 17

def objects
  @objects
end

#page_objectObject (readonly)

the raw PDF object that defines this page



20
21
22
# File 'lib/pdf/reader/page.rb', line 20

def page_object
  @page_object
end

Instance Method Details

#attributesObject

Returns the attributes that accompany this page. Includes attributes inherited from parents.



51
52
53
54
55
56
57
# File 'lib/pdf/reader/page.rb', line 51

def attributes
  hash = {}
  page_with_ancestors.reverse.each do |obj|
    hash.merge!(@objects.deref(obj))
  end
  hash
end

#fontsObject

return a hash of fonts used on this page.

The keys are the font labels used within the page content stream.

The values are a PDF::Reader::Font instances that provide access to most available metrics for each font.



79
80
81
82
83
84
# File 'lib/pdf/reader/page.rb', line 79

def fonts
  raw_fonts = objects.deref(resources[:Font] || {})
  ::Hash[raw_fonts.map { |label, font|
    [label, PDF::Reader::Font.new(objects, objects.deref(font))]
  }]
end

#inspectObject

return a friendly string representation of this page



44
45
46
# File 'lib/pdf/reader/page.rb', line 44

def inspect
  "<PDF::Reader::Page page: #{@pagenum}>"
end

#numberObject

return the number of this page within the full document



38
39
40
# File 'lib/pdf/reader/page.rb', line 38

def number
  @pagenum
end

#raw_contentObject

returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.



123
124
125
126
127
128
129
130
# File 'lib/pdf/reader/page.rb', line 123

def raw_content
  contents = objects.deref(@page_object[:Contents])
  [contents].flatten.compact.map { |obj|
    objects.deref(obj)
  }.map { |obj|
    obj.unfiltered_data
  }.join
end

#resourcesObject

Returns the resources that accompany this page. Includes resources inherited from parents.



62
63
64
# File 'lib/pdf/reader/page.rb', line 62

def resources
  @resources ||= @objects.deref(attributes[:Resources]) || {}
end

#textObject Also known as: to_s

returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯



89
90
91
92
93
# File 'lib/pdf/reader/page.rb', line 89

def text
  receiver = PageTextReceiver.new
  walk(receiver)
  receiver.content
end

#walk(*receivers) ⇒ Object

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.

This is mostly low level and you can probably ignore it unless you need access to soemthing like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method

If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.

Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.



115
116
117
118
# File 'lib/pdf/reader/page.rb', line 115

def walk(*receivers)
  callback(receivers, :page=, [self])
  content_stream(receivers, raw_content)
end

#xobjectsObject

Returns the XObjects that are available to this page



68
69
70
# File 'lib/pdf/reader/page.rb', line 68

def xobjects
  resources[:XObject] || {}
end