Class: HexaPDF::Document

Inherits:
Object
  • Object
show all
Defined in:
lib/hexapdf/document.rb

Overview

Represents one PDF document.

A PDF document consists of (indirect) objects, so the main job of this class is to provide methods for working with these objects. However, since a PDF document may also be incrementally updated and can therefore contain one or more revisions, there are also methods to work with these revisions.

Note: This class provides everything to work on PDF documents on a low-level basis. This means that there are no convenience methods for higher PDF functionality whatsoever.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io: nil, decryption_opts: {}, config: {}) ⇒ Document

Creates a new PDF document, either an empty one or one read from the provided io.

When an IO object is provided and it contains an encrypted PDF file, it is automatically decrypted behind the scenes. The decryption_opts argument has to be set appropriately in this case.

Options:

io

If an IO object is provided, then this document can read PDF objects from this IO object, otherwise it can only contain created PDF objects.

decryption_opts

A hash with options for decrypting the PDF objects loaded from the IO.

config

A hash with configuration options that is deep-merged into the default configuration (see DefaultDocumentConfiguration), meaning that direct sub-hashes are merged instead of overwritten.



123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/hexapdf/document.rb', line 123

def initialize(io: nil, decryption_opts: {}, config: {})
  @config = Configuration.with_defaults(config)
  @version = '1.2'

  @revisions = Revisions.from_io(self, io)
  if encrypted? && @config['document.auto_decrypt']
    @security_handler = Encryption::SecurityHandler.set_up_decryption(self, decryption_opts)
  else
    @security_handler = nil
  end

  @listeners = {}
end

Instance Attribute Details

#configObject (readonly)

The configuration for the document.



102
103
104
# File 'lib/hexapdf/document.rb', line 102

def config
  @config
end

#revisionsObject (readonly)

The revisions of the document.



105
106
107
# File 'lib/hexapdf/document.rb', line 105

def revisions
  @revisions
end

Class Method Details

.open(filename, **kwargs) ⇒ Object

:call-seq:

Document.open(filename, **docargs)                   -> doc
Document.open(filename, **docargs) {|doc| block}     -> obj

Creates a new PDF Document object for the given file.

Depending on whether a block is provided, the functionality is different:

  • If no block is provided, the whole file is instantly read into memory and the PDF Document created for it is returned.

  • If a block is provided, the file is opened and a PDF Document is created for it. The created document is passed as an argument to the block and when the block returns the associated file object is closed. The value of the block will be returned.

The block version is useful, for example, when you are dealing with a large file and you only need a small portion of it.

The provided keyword arguments (except io) are passed on unchanged to Document.new.



91
92
93
94
95
96
97
98
99
# File 'lib/hexapdf/document.rb', line 91

def self.open(filename, **kwargs)
  if block_given?
    File.open(filename, 'rb') do |file|
      yield(new(**kwargs, io: file))
    end
  else
    new(**kwargs, io: StringIO.new(File.binread(filename)))
  end
end

Instance Method Details

#add(obj, revision: :current, **wrap_opts) ⇒ Object

:call-seq:

doc.add(obj, revision: :current, **wrap_opts)     -> indirect_object

Adds the object to the specified revision of the document and returns the wrapped indirect object.

The object can either be a native Ruby object (Hash, Array, Integer, …) or a HexaPDF::Object. If it is not the latter, #wrap is called with the object and the additional keyword arguments.

If the revision option is :current, the current revision is used. Otherwise revision should be a revision index.



191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# File 'lib/hexapdf/document.rb', line 191

def add(obj, revision: :current, **wrap_opts)
  obj = wrap(obj, wrap_opts) unless obj.kind_of?(HexaPDF::Object)

  revision = (revision == :current ? @revisions.current : @revisions.revision(revision))
  if revision.nil?
    raise ArgumentError, "Invalid revision index specified"
  end

  if obj.document? && obj.document != self
    raise HexaPDF::Error, "Can't add object that is already attached to another document"
  end
  obj.document = self

  if obj.indirect? && (rev_obj = revision.object(obj.oid))
    if rev_obj.equal?(obj)
      return obj
    else
      raise HexaPDF::Error, "Can't add object because the specified revision already has " \
        "an object with object number #{obj.oid}"
    end
  end

  obj.oid = @revisions.map(&:next_free_oid).max unless obj.indirect?

  revision.add(obj)
end

#catalogObject

Returns the document’s catalog, the root of the object tree.



448
449
450
# File 'lib/hexapdf/document.rb', line 448

def catalog
  trailer.catalog
end

#delete(ref, revision: :all, mark_as_free: true) ⇒ Object

:call-seq:

doc.delete(ref, revision: :all)
doc.delete(oid, revision: :all)

Deletes the indirect object specified by an exact reference or by an object number from the document.

Options:

revision

Specifies from which revisions the object should be deleted:

:all

Delete the object from all revisions.

:current

Delete the object only from the current revision.

mark_as_free

If true, objects are only marked as free objects instead of being actually deleted.



234
235
236
237
238
239
240
241
242
243
# File 'lib/hexapdf/document.rb', line 234

def delete(ref, revision: :all, mark_as_free: true)
  case revision
  when :current
    @revisions.current.delete(ref, mark_as_free: mark_as_free)
  when :all
    @revisions.each {|rev| rev.delete(ref, mark_as_free: mark_as_free)}
  else
    raise ArgumentError, "Unsupported option revision: #{revision}"
  end
end

#deref(obj) ⇒ Object

Dereferences the given object.

Return the object itself if it is not a reference, or the indirect object specified by the reference.



161
162
163
# File 'lib/hexapdf/document.rb', line 161

def deref(obj)
  obj.kind_of?(Reference) ? object(obj) : obj
end

#dispatch_message(name, *args) ⇒ Object

Dispatches the message name with the given arguments to all registered listeners.



414
415
416
# File 'lib/hexapdf/document.rb', line 414

def dispatch_message(name, *args)
  @listeners[name] && @listeners[name].each {|obj| obj.call(*args)}
end

#each(current: true, &block) ⇒ Object

:call-seq:

doc.each(current: true) {|obj| block }        -> doc
doc.each(current: true) {|obj, rev| block }   -> doc
doc.each(current: true)                       -> Enumerator

Calls the given block once for every object in the PDF document. The block may either accept only the object or the object and the revision it is in.

By default, only the current version of each object is returned which implies that each object number is yielded exactly once. If the current option is false, all stored objects from newest to oldest are returned, not only the current version of each object.

The current option can make a difference because the document can contain multiple revisions:

  • Multiple revisions may contain objects with the same object and generation numbers, e.g. two (different) objects with oid/gen [3,0].

  • Additionally, there may also be objects with the same object number but different generation numbers in different revisions, e.g. one object with oid/gen [3,0] and one with oid/gen [3,1].



387
388
389
390
391
392
393
394
395
396
397
398
399
400
# File 'lib/hexapdf/document.rb', line 387

def each(current: true, &block)
  return to_enum(__method__, current: current) unless block_given?

  yield_rev = (block.arity == 2)
  oids = {}
  @revisions.reverse_each do |rev|
    rev.each do |obj|
      next if current && oids.include?(obj.oid)
      (yield_rev ? yield(obj, rev) : yield(obj))
      oids[obj.oid] = true
    end
  end
  self
end

#encrypt(name: :Standard, **options) ⇒ Object

Encrypts the document.

This is done by setting up a security handler for this purpose and populating the trailer’s Encrypt dictionary accordingly. The actual encryption, however, is only done when writing the document.

The security handler used for encrypting is selected via the name argument. All other arguments are passed on the security handler.

If the document should not be encrypted, the name argument has to be set to nil. This removes the security handler and deletes the trailer’s Encrypt dictionary.

See: HexaPDF::Encryption::SecurityHandler#set_up_encryption and HexaPDF::Encryption::StandardSecurityHandler::EncryptionOptions for possible encryption options.



498
499
500
501
502
503
504
505
# File 'lib/hexapdf/document.rb', line 498

def encrypt(name: :Standard, **options)
  if name.nil?
    trailer.delete(:Encrypt)
    @security_handler = nil
  else
    @security_handler = Encryption::SecurityHandler.set_up_encryption(self, name, **options)
  end
end

#encrypted?Boolean

Returns true if the document is encrypted.

Returns:

  • (Boolean)


479
480
481
# File 'lib/hexapdf/document.rb', line 479

def encrypted?
  !trailer[:Encrypt].nil?
end

#fontsObject

Returns the FontUtils object that provides convenience methods for working with fonts.



425
426
427
# File 'lib/hexapdf/document.rb', line 425

def fonts
  @font_utils ||= FontUtils.new(self)
end

#import(obj) ⇒ Object

:call-seq:

doc.import(obj)     -> imported_object

Imports the given, with a different document associated PDF object and returns the imported object.

If the same argument is provided in multiple invocations, the import is done only once and the previously imoprted object is returned.

See: Importer



255
256
257
258
259
260
261
# File 'lib/hexapdf/document.rb', line 255

def import(obj)
  if !obj.kind_of?(HexaPDF::Object) || !obj.document? || obj.document == self
    raise ArgumentError, "Importing only works for PDF objects associated " \
      "with another document"
  end
  HexaPDF::Importer.for(source: obj.document, destination: self).import(obj)
end

#object(ref) ⇒ Object

:call-seq:

doc.object(ref)    -> obj or nil
doc.object(oid)    -> obj or nil

Returns the current version of the indirect object for the given exact reference or for the given object number.

For references to unknown objects, nil is returned but free objects are represented by a PDF Null object, not by nil!

See: PDF1.7 s7.3.9



148
149
150
151
152
153
154
155
# File 'lib/hexapdf/document.rb', line 148

def object(ref)
  i = @revisions.size - 1
  while i >= 0
    return @revisions[i].object(ref) if @revisions[i].object?(ref)
    i -= 1
  end
  nil
end

#object?(ref) ⇒ Boolean

:call-seq:

doc.object?(ref)    -> true or false
doc.object?(oid)    -> true or false

Returns true if the the document contains an indirect object for the given exact reference or for the given object number.

Even though this method might return true for some references, #object may return nil because this method takes all revisions into account. Also see the discussion on #each for more information.

Returns:

  • (Boolean)


175
176
177
# File 'lib/hexapdf/document.rb', line 175

def object?(ref)
  @revisions.any? {|rev| rev.object?(ref)}
end

#pagesObject

Returns the root node of the document’s page tree.

See: HexaPDF::Type::PageTreeNode



455
456
457
# File 'lib/hexapdf/document.rb', line 455

def pages
  catalog.pages
end

#register_listener(name, callable = nil, &block) ⇒ Object

:call-seq:

doc.register_listener(name, callable)             -> callable
doc.register_listener(name) {|*args| block}       -> block

Registers the given listener for the message name.



407
408
409
410
411
# File 'lib/hexapdf/document.rb', line 407

def register_listener(name, callable = nil, &block)
  callable ||= block
  (@listeners[name] ||= []) << callable
  callable
end

#security_handlerObject

Returns the security handler that is used for decrypting or encrypting the document, or nil if none is set.

  • If the document was created by reading an existing file and the document was automatically decrypted, then this method returns the handler for decrypting.

  • Once the #encrypt method is called, the specified security handler for encrypting is returned.



515
516
517
# File 'lib/hexapdf/document.rb', line 515

def security_handler
  @security_handler
end

#task(name, **opts, &block) ⇒ Object

Executes the given task and returns its result.

Tasks provide an extensible way for performing operations on a PDF document without cluttering the Document interface.

See Task for more information.



435
436
437
438
439
440
# File 'lib/hexapdf/document.rb', line 435

def task(name, **opts, &block)
  task = GlobalConfiguration.constantize('task.map'.freeze, name) do
    raise HexaPDF::Error, "No task named '#{name}' is available"
  end
  task.call(self, **opts, &block)
end

#trailerObject

Returns the trailer dictionary for the document.



443
444
445
# File 'lib/hexapdf/document.rb', line 443

def trailer
  @revisions.current.trailer
end

#unwrap(object, seen = {}) ⇒ Object

:call-seq:

document.unwrap(obj)   -> unwrapped_obj

Recursively unwraps the object to get native Ruby objects (i.e. Hash, Array, Integer, … instead of HexaPDF::Reference and HexaPDF::Object).



344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
# File 'lib/hexapdf/document.rb', line 344

def unwrap(object, seen = {})
  object = deref(object)
  object = object.data if object.kind_of?(HexaPDF::Object)
  if seen.key?(object)
    raise HexaPDF::Error, "Can't unwrap a recursive structure"
  end

  case object
  when Hash
    seen[object] = true
    object.each_with_object({}) {|(key, val), memo| memo[key] = unwrap(val, seen.dup)}
  when Array
    seen[object] = true
    object.map {|inner_o| unwrap(inner_o, seen.dup)}
  when HexaPDF::PDFData
    seen[object] = true
    unwrap(object.value, seen.dup)
  else
    object
  end
end

#utilsObject

Returns a DocumentUtils object that provides convenience methods for often used functionality like adding images.



420
421
422
# File 'lib/hexapdf/document.rb', line 420

def utils
  @utils ||= DocumentUtils.new(self)
end

#validate(auto_correct: true, &block) ⇒ Object

:call-seq:

doc.validate(auto_correct: true)                               -> true or false
doc.validate(auto_correct: true) {|msg, correctable| block }   -> true or false

Validates all objects of the document, with optional auto-correction, and returns true if everything is fine.

If a block is given, it is called on validation problems.

See HexaPDF::Object#validate for more information.



529
530
531
532
533
534
535
# File 'lib/hexapdf/document.rb', line 529

def validate(auto_correct: true, &block)
  result = trailer.validate(auto_correct: auto_correct, &block)
  each(current: false) do |obj|
    result &&= obj.validate(auto_correct: auto_correct, &block)
  end
  result
end

#versionObject

Returns the PDF document’s version as string (e.g. ‘1.4’).

This method takes the file header version and the catalog’s /Version key into account. If a version has been set manually and the catalog’s /Version key refers to a later version, the later version is used.

See: PDF1.7 s7.2.2



466
467
468
469
# File 'lib/hexapdf/document.rb', line 466

def version
  catalog_version = (catalog[:Version] || '1.0'.freeze).to_s
  (@version < catalog_version ? catalog_version : @version)
end

#version=(value) ⇒ Object

Sets the version of the PDF document. The argument must be a string in the format ‘M.N’ where M is the major version and N the minor version (e.g. ‘1.4’ or ‘2.0’).

Raises:

  • (ArgumentError)


473
474
475
476
# File 'lib/hexapdf/document.rb', line 473

def version=(value)
  raise ArgumentError, "PDF version must follow format M.N" unless value.to_s =~ /\A\d\.\d\z/
  @version = value.to_s
end

#wrap(obj, type: nil, subtype: nil, oid: nil, gen: nil, stream: nil) ⇒ Object

Wraps the given object inside a HexaPDF::Object class which allows one to use convenience functions to work with the object.

The obj argument can also be a HexaPDF::Object object so that it can be re-wrapped if needed.

The class of the returned object is always a subclass of HexaPDF::Object (or of HexaPDF::Stream if a stream is given). Which subclass is used, depends on the values of the type and subtype options as well as on the ‘object.type_map’ and ‘object.subtype_map’ global configuration options:

  • If only type or subtype is provided and a mapping is found, the resulting class is used.

  • If both type and subtype are provided and and a mapping for subtype is found, the resulting class is used. If no mapping is found but there is a mapping for type, the mapped class is used.

  • If there is no valid class after the above steps, HexaPDF::Stream is used if a stream is given, HexaPDF::Dictionary is used if the given objecct is a hash or else HexaPDF::Object is used.

Options:

:type

(Symbol or Class) The type of a PDF object that should be used for wrapping. This could be, for example, :Pages. If a class object is provided, it is used directly instead of the type detection system.

:subtype

(Symbol) The subtype of a PDF object which further qualifies a type. For example, image objects in PDF have a type of :XObject and a subtype of :Image.

:oid

(Integer) The object number that should be set on the wrapped object. Defaults to 0 or the value of the given object’s object number.

:gen

(Integer) The generation number that should be set on the wrapped object. Defaults to 0 or the value of the given object’s generation number.

:stream

(String or StreamData) The stream object which should be set on the wrapped object.



302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
# File 'lib/hexapdf/document.rb', line 302

def wrap(obj, type: nil, subtype: nil, oid: nil, gen: nil, stream: nil)
  data = if obj.kind_of?(HexaPDF::Object)
           obj.data
         else
           HexaPDF::PDFData.new(obj)
         end
  data.oid = oid if oid
  data.gen = gen if gen
  data.stream = stream if stream

  if type.kind_of?(Class)
    klass = type
  else
    default = if data.stream
                HexaPDF::Stream
              elsif data.value.kind_of?(Hash)
                HexaPDF::Dictionary
              else
                HexaPDF::Object
              end
    if data.value.kind_of?(Hash)
      type ||= deref(data.value[:Type])
      subtype ||= deref(data.value[:Subtype])
    end

    if subtype
      klass = GlobalConfiguration.constantize('object.subtype_map'.freeze, subtype)
    end
    if type && !klass
      klass = GlobalConfiguration.constantize('object.type_map'.freeze, type)
    end
    klass ||= default
  end

  klass.new(data, document: self)
end

#write(file_or_io, validate: true, update_fields: true, optimize: false) ⇒ Object

:call-seq:

doc.write(filename, validate: true, update_fields: true, optimize: false)
doc.write(io, validate: true, update_fields: true, optimize: false)

Writes the document to the given file (in case io is a String) or IO stream.

Before the document is written, it is validated using #validate and an error is raised if the document is not valid. However, this step can be skipped if needed.

Options:

validate

Validates the document and raises an error if an uncorrectable problem is found.

update_fields

Updates the /ID field in the trailer dictionary as well as the /ModDate field in the trailer’s /Info dictionary so that it is clear that the document has been updated.

optimize

Optimize the file size by using object and cross-reference streams. This will raise the PDF version to at least 1.5.



558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
# File 'lib/hexapdf/document.rb', line 558

def write(file_or_io, validate: true, update_fields: true, optimize: false)
  dispatch_message(:complete_objects)

  if update_fields
    trailer.update_id
    trailer.info[:ModDate] = Time.now
  end

  if validate
    self.validate(auto_correct: true) do |msg, correctable|
      next if correctable
      raise HexaPDF::Error, "Validation error: #{msg}"
    end
  end

  if optimize
    task(:optimize, object_streams: :generate)
    self.version = '1.5' if version < '1.5'
  end

  dispatch_message(:before_write)

  if file_or_io.kind_of?(String)
    File.open(file_or_io, 'w+') {|file| Writer.write(self, file)}
  else
    Writer.write(self, file_or_io)
  end
end