Class: PDF::Reader::Content

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf/reader/content.rb

Overview

Walks the PDF file and calls the appropriate callback methods when something of interest is found.

The callback methods should exist on the receiver object passed into the constructor. Whenever some content is found that will trigger a callback, the receiver is checked to see if the callback is defined.

If it is defined it will be called. If not, processing will continue.

Available Callbacks

The following callbacks are available and should be methods defined on your receiver class. Only implement the ones you need - the rest will be ignored.

Some callbacks will include parameters which will be passed in as an array. For callbacks that supply no paramters, or where you don’t need them, the *params argument can be left off. Some example callback method definitions are:

def begin_document
def end_page
def show_text(string, *params)
def fill_stroke(*params)

You should be able to infer the basic command the callback is reporting based on the name. For further experimentation, define the callback with just a *params parameter, then print out the contents of the array using something like:

puts params.inspect

Text Callbacks

All text passed into these callbacks will be encoded as UTF-8. Depending on where (and when) the PDF was generated, there’s a good chance the text is NOT stored as UTF-8 internally so be careful when doing a comparison on strings returned from PDF::Reader (when doing unit tests for example). The string may not be byte-by-byte identical with the string that was originally written to the PDF.

  • end_text_object

  • move_to_start_of_next_line

  • set_character_spacing

  • move_text_position

  • move_text_position_and_set_leading

  • set_text_font_and_size

  • show_text

  • show_text_with_positioning

  • set_text_leading

  • set_text_matrix_and_text_line_matrix

  • set_text_rendering_mode

  • set_text_rise

  • set_word_spacing

  • set_horizontal_text_scaling

  • move_to_next_line_and_show_text

  • set_spacing_next_line_show_text

Graphics Callbacks

  • close_fill_stroke

  • fill_stroke

  • close_fill_stroke_with_even_odd

  • fill_stroke_with_even_odd

  • begin_marked_content_with_pl

  • begin_inline_image

  • begin_marked_content

  • begin_text_object

  • append_curved_segment

  • concatenate_matrix

  • set_stroke_color_space

  • set_nonstroke_color_space

  • set_line_dash

  • set_glyph_width

  • set_glyph_width_and_bounding_box

  • invoke_xobject

  • define_marked_content_with_pl

  • end_inline_image

  • end_marked_content

  • fill_path_with_nonzero

  • fill_path_with_nonzero

  • fill_path_with_even_odd

  • set_gray_for_stroking

  • set_gray_for_nonstroking

  • set_graphics_state_parameters

  • close_subpath

  • set_flatness_tolerance

  • begin_inline_image_data

  • set_line_join_style

  • set_line_cap_style

  • set_cmyk_color_for_stroking,

  • set_cmyk_color_for_nonstroking

  • append_line

  • begin_new_subpath

  • set_miter_limit

  • define_marked_content_point

  • end_path

  • save_graphics_state

  • restore_graphics_state

  • append_rectangle

  • set_rgb_color_for_stroking

  • set_rgb_color_for_nonstroking

  • set_color_rendering_intent

  • close_and_stroke_path

  • stroke_path

  • set_color_for_stroking

  • set_color_for_nonstroking

  • set_color_for_stroking_and_special

  • set_color_for_nonstroking_and_special

  • paint_area_with_shading_pattern

  • append_curved_segment_initial_point_replicated

  • set_line_width

  • set_clipping_path_with_nonzero

  • set_clipping_path_with_even_odd

  • append_curved_segment_final_point_replicated

Misc Callbacks

  • begin_compatibility_section

  • end_compatibility_section,

  • begin_document

  • end_document

  • begin_page_container

  • end_page_container

  • begin_page

  • end_page

  • metadata

  • xml_metadata

  • page_count

  • begin_form_xobject

  • end_form_xobject

Resource Callbacks

Each page can contain (or inherit) a range of resources required for the page, including things like fonts and images. The following callbacks may appear after begin_page if the relevant resources exist on a page:

  • resource_procset

  • resource_xobject

  • resource_extgstate

  • resource_colorspace

  • resource_pattern

  • resource_font

In most cases, these callbacks associate a name with each resource, allowing it to be referred to by name in the page content. For example, an XObject can hold an image. If it gets mapped to the name “IM1”, then it can be placed on the page using invoke_xobject “IM1”.

Constant Summary collapse

OPERATORS =
{
  'b'   => :close_fill_stroke,
  'B'   => :fill_stroke,
  'b*'  => :close_fill_stroke_with_even_odd,
  'B*'  => :fill_stroke_with_even_odd,
  'BDC' => :begin_marked_content_with_pl,
  'BI'  => :begin_inline_image,
  'BMC' => :begin_marked_content,
  'BT'  => :begin_text_object,
  'BX'  => :begin_compatibility_section,
  'c'   => :append_curved_segment,
  'cm'  => :concatenate_matrix,
  'CS'  => :set_stroke_color_space,
  'cs'  => :set_nonstroke_color_space,
  'd'   => :set_line_dash,
  'd0'  => :set_glyph_width,
  'd1'  => :set_glyph_width_and_bounding_box,
  'Do'  => :invoke_xobject,
  'DP'  => :define_marked_content_with_pl,
  'EI'  => :end_inline_image,
  'EMC' => :end_marked_content,
  'ET'  => :end_text_object,
  'EX'  => :end_compatibility_section,
  'f'   => :fill_path_with_nonzero,
  'F'   => :fill_path_with_nonzero,
  'f*'  => :fill_path_with_even_odd,
  'G'   => :set_gray_for_stroking,
  'g'   => :set_gray_for_nonstroking,
  'gs'  => :set_graphics_state_parameters,
  'h'   => :close_subpath,
  'i'   => :set_flatness_tolerance,
  'ID'  => :begin_inline_image_data,
  'j'   => :set_line_join_style,
  'J'   => :set_line_cap_style,
  'K'   => :set_cmyk_color_for_stroking,
  'k'   => :set_cmyk_color_for_nonstroking,
  'l'   => :append_line,
  'm'   => :begin_new_subpath,
  'M'   => :set_miter_limit,
  'MP'  => :define_marked_content_point,
  'n'   => :end_path,
  'q'   => :save_graphics_state,
  'Q'   => :restore_graphics_state,
  're'  => :append_rectangle,
  'RG'  => :set_rgb_color_for_stroking,
  'rg'  => :set_rgb_color_for_nonstroking,
  'ri'  => :set_color_rendering_intent,
  's'   => :close_and_stroke_path,
  'S'   => :stroke_path,
  'SC'  => :set_color_for_stroking,
  'sc'  => :set_color_for_nonstroking,
  'SCN' => :set_color_for_stroking_and_special,
  'scn' => :set_color_for_nonstroking_and_special,
  'sh'  => :paint_area_with_shading_pattern,
  'T*'  => :move_to_start_of_next_line,
  'Tc'  => :set_character_spacing,
  'Td'  => :move_text_position,
  'TD'  => :move_text_position_and_set_leading,
  'Tf'  => :set_text_font_and_size,
  'Tj'  => :show_text,
  'TJ'  => :show_text_with_positioning,
  'TL'  => :set_text_leading,
  'Tm'  => :set_text_matrix_and_text_line_matrix,
  'Tr'  => :set_text_rendering_mode,
  'Ts'  => :set_text_rise,
  'Tw'  => :set_word_spacing,
  'Tz'  => :set_horizontal_text_scaling,
  'v'   => :append_curved_segment_initial_point_replicated,
  'w'   => :set_line_width,
  'W'   => :set_clipping_path_with_nonzero,
  'W*'  => :set_clipping_path_with_even_odd,
  'y'   => :append_curved_segment_final_point_replicated,
  '\''  => :move_to_next_line_and_show_text,
  '"'   => :set_spacing_next_line_show_text,
}

Instance Method Summary collapse

Constructor Details

#initialize(receiver, xref) ⇒ Content

Create a new PDF::Reader::Content object to process the contents of PDF file

  • receiver - an object containing the required callback methods

  • xref - a PDF::Reader::Xref object that contains references to all the objects in a PDF file



251
252
253
254
255
# File 'lib/pdf/reader/content.rb', line 251

def initialize (receiver, xref)
  @receiver = receiver
  @xref     = xref
  @fonts ||= {}
end

Instance Method Details

#callback(name, params = []) ⇒ Object

calls the name callback method on the receiver class with params as the arguments



471
472
473
# File 'lib/pdf/reader/content.rb', line 471

def callback (name, params=[])
  @receiver.send(name, *params) if @receiver.respond_to?(name)
end

#content_stream(instructions) ⇒ Object

Reads a PDF content stream and calls all the appropriate callback methods for the operators it contains



351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# File 'lib/pdf/reader/content.rb', line 351

def content_stream (instructions)
  instructions = instructions.unfiltered_data if instructions.kind_of?(PDF::Reader::Stream)
  @buffer =   Buffer.new(StringIO.new(instructions))
  @parser =   Parser.new(@buffer, @xref)
  @params ||= []

  while (token = @parser.parse_token(OPERATORS))
    if token.kind_of?(Token) and OPERATORS.has_key?(token)
      @current_font = @params.first if OPERATORS[token] == :set_text_font_and_size

      # handle special cases in response to certain operators
      if OPERATORS[token].to_s.include?("show_text") && @fonts[@current_font]
        # convert any text to utf-8
        @params = @fonts[@current_font].to_utf8(@params)
      elsif token == "ID"
        # inline image data, first convert the current params into a more familiar hash
        map = {}
        @params.each_slice(2) do |a|
          map[a.first] = a.last
        end
        @params = [map]
        # read the raw image data from the buffer without tokenising
        @params << @buffer.read_until("EI")
      end

      callback(OPERATORS[token], @params)

      if OPERATORS[token] == :invoke_xobject
        xobject_label = @params.first
        @params.clear
        walk_xobject_form(xobject_label)
      else
        @params.clear
      end
    else
      @params << token
    end
  end
rescue EOFError => e
  raise MalformedPDFError, "End Of File while processing a content stream"
end

#current_resourcesObject

Return a merged hash of all resources that are current. Pages, page and xobject



341
342
343
344
345
346
347
# File 'lib/pdf/reader/content.rb', line 341

def current_resources
  hash = {}
  resources.each do |res|
    hash.merge!(res)
  end
  hash
end

#document(root) ⇒ Object

Begin processing the document



282
283
284
285
286
# File 'lib/pdf/reader/content.rb', line 282

def document (root)
  callback(:begin_document, [root])
  walk_pages(@xref.object(root[:Pages]))
  callback(:end_document)
end

#metadata(root, info) ⇒ Object

Begin processing the document metadata



258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
# File 'lib/pdf/reader/content.rb', line 258

def  (root, info)
  info = decode_strings(info)

  # may be useful to some people
  callback(:pdf_version, @xref.pdf_version)

  # ye olde metadata
  callback(:metadata, [info]) if info

  # new style xml metadata
  if root[:Metadata]
    stream = @xref.object(root[:Metadata])
    callback(:xml_metadata,stream.unfiltered_data)
  end

  # page count
  if (pages = @xref.object(root[:Pages]))
    if (count = @xref.object(pages[:Count]))
      callback(:page_count, count.to_i)
    end
  end
end

#resolve_references(obj) ⇒ Object

Convert any PDF::Reader::Resource objects into a real object



456
457
458
459
460
461
462
463
464
465
466
467
468
# File 'lib/pdf/reader/content.rb', line 456

def resolve_references(obj)
  case obj
  when PDF::Reader::Stream then
    obj.hash = resolve_references(obj.hash)
    obj
  when PDF::Reader::Reference then
    resolve_references(@xref.object(obj))
  when Hash                   then obj.each { |key,val| obj[key] = resolve_references(val) }
  when Array                  then obj.collect { |item| resolve_references(item) }
  else
    obj
  end
end

#walk_pages(page) ⇒ Object

Walk over all pages in the PDF file, calling the appropriate callbacks for each page and all its content



290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# File 'lib/pdf/reader/content.rb', line 290

def walk_pages (page)

  # extract page content
  if page[:Type] == :Pages
    callback(:begin_page_container, [page])
    res = @xref.object(page[:Resources])
    resources.push res if res
    @xref.object(page[:Kids]).each {|child| walk_pages(@xref.object(child))}
    resources.pop if res
    callback(:end_page_container)
  elsif page[:Type] == :Page
    callback(:begin_page, [page])
    res = @xref.object(page[:Resources])
    resources.push res if res
    walk_resources(current_resources)

    if @xref.object(page[:Contents]).kind_of?(Array)
      contents = @xref.object(page[:Contents])
    else
      contents = [page[:Contents]]
    end

    contents.each do |content|
      obj = @xref.object(content)
      content_stream(obj)
    end if page.has_key?(:Contents) and page[:Contents]

    resources.pop if res
    callback(:end_page)
  end
end

#walk_resources(resources) ⇒ Object



393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
# File 'lib/pdf/reader/content.rb', line 393

def walk_resources(resources)
  return unless resources.respond_to?(:[])

  resources = resolve_references(resources)

  # extract any procset information
  if resources[:ProcSet]
    callback(:resource_procset, resources[:ProcSet])
  end

  # extract any xobject information
  if resources[:XObject]
    @xref.object(resources[:XObject]).each do |name, val|
      callback(:resource_xobject, [name, @xref.object(val)])
    end
  end

  # extract any extgstate information
  if resources[:ExtGState]
    @xref.object(resources[:ExtGState]).each do |name, val|
      callback(:resource_extgstate, [name, @xref.object(val)])
    end
  end

  # extract any colorspace information
  if resources[:ColorSpace]
    @xref.object(resources[:ColorSpace]).each do |name, val|
      callback(:resource_colorspace, [name, @xref.object(val)])
    end
  end

  # extract any pattern information
  if resources[:Pattern]
    @xref.object(resources[:Pattern]).each do |name, val|
      callback(:resource_pattern, [name, @xref.object(val)])
    end
  end

  # extract any font information
  if resources[:Font]
    @xref.object(resources[:Font]).each do |label, desc|
      desc = @xref.object(desc)
      @fonts[label] = PDF::Reader::Font.new
      @fonts[label].label = label
      @fonts[label].subtype = desc[:Subtype] if desc[:Subtype]
      @fonts[label].basefont = desc[:BaseFont] if desc[:BaseFont]
      @fonts[label].encoding = PDF::Reader::Encoding.new(@xref.object(desc[:Encoding]))
      @fonts[label].descendantfonts = desc[:DescendantFonts] if desc[:DescendantFonts]
      if desc[:ToUnicode]
        # this stream is a cmap
        begin
          stream = desc[:ToUnicode]
          @fonts[label].tounicode = PDF::Reader::CMap.new(stream.unfiltered_data)
        rescue
          # if the CMap fails to parse, don't worry too much. Means we can't translate the text properly
        end
      end
      callback(:resource_font, [label, @fonts[label]])
    end
  end
end

#walk_xobject_form(label) ⇒ Object

Retreive the XObject for the supplied label and if it’s a Form, walk it like a regular page content stream.



325
326
327
328
329
330
331
332
333
334
335
336
# File 'lib/pdf/reader/content.rb', line 325

def walk_xobject_form(label)
  xobjects = @xref.object(current_resources[:XObject]) || {}
  xobject  = @xref.object(xobjects[label])

  if xobject && xobject.hash[:Subtype] == :Form
    callback(:begin_form_xobject)
    resources = @xref.object(xobject.hash[:Resources])
    walk_resources(resources) if resources
    content_stream(xobject)
    callback(:end_form_xobject)
  end
end