Class: PDF::Reader::Buffer

Inherits:

Object

Object
PDF::Reader::Buffer

Defined in:: lib/pdf/reader/buffer.rb

Overview

A string tokeniser that recognises PDF grammar. When passed an IO stream or a string, repeated calls to token() will return the next token from the source.

This is very low level, and getting the raw tokens is not very useful in itself.

This will usually be used in conjunction with PDF:Reader::Parser, which converts the raw tokens into objects we can work with (strings, ints, arrays, etc)

Instance Attribute Summary collapse

#pos ⇒ Object readonly

Returns the value of attribute pos.

Instance Method Summary collapse

#empty? ⇒ Boolean

return true if there are no more tokens left.
#find_first_xref_offset ⇒ Object

return the byte offset where the first XRef table in th source can be found.
#initialize(io, opts = {}) ⇒ Buffer constructor

Creates a new buffer.
#read(bytes, opts = {}) ⇒ Object

return raw bytes from the underlying IO stream.
#read_until(needle) ⇒ Object

return raw bytes from the underlying IO stream.
#token ⇒ Object

return the next token from the source.

Constructor Details

#initialize(io, opts = {}) ⇒ `Buffer`

Creates a new buffer.

Params:

io - an IO stream or string with the raw data to tokenise

options:

:seek - a byte offset to seek to before starting to tokenise

# File 'lib/pdf/reader/buffer.rb', line 52

def initialize (io, opts = {})
  @io = io
  @tokens = []
  @options = opts

  @io.seek(opts[:seek]) if opts[:seek]
  @pos = @io.pos
end

Instance Attribute Details

#pos ⇒ `Object` (readonly)

Returns the value of attribute pos.



40
41
42

# File 'lib/pdf/reader/buffer.rb', line 40

def pos
  @pos
end

Instance Method Details

#empty? ⇒ `Boolean`

return true if there are no more tokens left

Returns:

(Boolean)

# File 'lib/pdf/reader/buffer.rb', line 63

def empty?
  prepare_tokens if @tokens.size < 3

  @tokens.empty?
end

#find_first_xref_offset ⇒ `Object`

return the byte offset where the first XRef table in th source can be found.

Raises:

(MalformedPDFError)

# File 'lib/pdf/reader/buffer.rb', line 139

def find_first_xref_offset
  @io.seek(-1024, IO::SEEK_END) rescue @io.seek(0)
  data = @io.read(1024)

  # the PDF 1.7 spec (section #3.4) says that EOL markers can be either \r, \n, or both.
  # To ensure we find the xref offset correctly, change all possible options to a
  # standard format
  data = data.gsub("\r\n","\n").gsub("\n\r","\n").gsub("\r","\n")
  lines = data.split(/\n/).reverse

  eof_index = nil

  lines.each_with_index do |line, index|
    if line =~ /^%%EOF\r?$/
      eof_index = index
      break
    end
  end

  raise MalformedPDFError, "PDF does not contain EOF marker" if eof_index.nil?
  raise MalformedPDFError, "PDF EOF marker does not follow offset" if eof_index >= lines.size-1
  lines[eof_index+1].to_i
end

#read(bytes, opts = {}) ⇒ `Object`

return raw bytes from the underlying IO stream.

bytes - the number of bytes to read

options:

:skip_eol - if true, the IO stream is advanced past any LF or CR
            bytes before it reads any data. This is to handle
            content streams, which have a CRLF or LF after the stream
            token.

# File 'lib/pdf/reader/buffer.rb', line 80

def read(bytes, opts = {})
  reset_pos

  if opts[:skip_eol]
    done = false
    while !done
      chr = @io.read(1)
      if chr.nil?
        return nil
      elsif chr != "\n" && chr != "\r"
        @io.seek(-1, IO::SEEK_CUR)
        done = true
      end
    end
  end

  bytes = @io.read(bytes)
  save_pos
  bytes
end

#read_until(needle) ⇒ `Object`

return raw bytes from the underlying IO stream. All bytes up to the first occurrence of needle will be returned. The match (if any) is not returned. The IO stream cursor is left on the first byte of the match.

needle - a string to search the IO stream for

# File 'lib/pdf/reader/buffer.rb', line 107

def read_until(needle)
  reset_pos
  out = ""
  size = needle.size

  while out[size * -1, size] != needle && !@io.eof?
    out << @io.read(1)
  end

  if out[size * -1, size] == needle
    out = out[0, out.size - size]
    @io.seek(size * -1, IO::SEEK_CUR)
  end

  save_pos
  out
end

#token ⇒ `Object`

return the next token from the source. Returns a string if a token is found, nil if there are no tokens left.

# File 'lib/pdf/reader/buffer.rb', line 128

def token
  reset_pos
  prepare_tokens if @tokens.size < 3
  merge_indirect_reference
  prepare_tokens if @tokens.size < 3

  @tokens.shift
end

Class: PDF::Reader::Buffer

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io, opts = {}) ⇒ Buffer

Instance Attribute Details

#pos ⇒ Object (readonly)

Instance Method Details

#empty? ⇒ Boolean

#find_first_xref_offset ⇒ Object

#read(bytes, opts = {}) ⇒ Object

#read_until(needle) ⇒ Object

#token ⇒ Object