Class: HexaPDF::Tokenizer

Inherits:

Object

Object
HexaPDF::Tokenizer

show all

Defined in:: lib/hexapdf/tokenizer.rb

Overview

Tokenizes the content of an IO object following the PDF rules.

See: PDF1.7 s7.2

Direct Known Subclasses

Content::Tokenizer

Defined Under Namespace

Classes: Token

Constant Summary collapse

TOKEN_DICT_START = :nodoc:

Token.new('<<'.b)

TOKEN_DICT_END = :nodoc:

Token.new('>>'.b)

TOKEN_ARRAY_START = :nodoc:

Token.new('['.b)

TOKEN_ARRAY_END = :nodoc:

Token.new(']'.b)

NO_MORE_TOKENS = This object is returned when there are no more tokens to read.

::Object.new

WHITESPACE = Characters defined as whitespace. See: PDF1.7 s7.2.2

" \n\r\0\t\f"

DELIMITER = Characters defined as delimiters. See: PDF1.7 s7.2.2

"()<>{}/[]%"

WHITESPACE_MULTI_RE = :nodoc:

/[#{WHITESPACE}]+/

WHITESPACE_OR_DELIMITER_RE = :nodoc:

/(?=[#{Regexp.escape(WHITESPACE + DELIMITER)}])/

Instance Attribute Summary collapse

#io ⇒ Object readonly

The IO object from the tokens are read.

Instance Method Summary collapse

#initialize(io) ⇒ Tokenizer constructor

Creates a new tokenizer.
#next_byte ⇒ Object

Reads the byte (an integer) at the current position and advances the scan pointer.
#next_integer_or_keyword ⇒ Object

Returns a single integer or keyword token read from the current position and advances the scan pointer.
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object

Returns the PDF object at the current position.
#next_token ⇒ Object

Returns a single token read from the current position and advances the scan pointer.
#next_xref_entry ⇒ Object

Reads the cross-reference subsection entry at the current position and advances the scan pointer.
#peek_token ⇒ Object

Returns the next token but does not advance the scan pointer.
#pos ⇒ Object

Returns the current position of the tokenizer inside in the IO object.
#pos=(pos) ⇒ Object

Sets the position at which the next token should be read.
#scan_until(re) ⇒ Object

Utility method for scanning until the given regular expression matches.
#skip_whitespace ⇒ Object

Skips all whitespace at the current position.

Constructor Details

#initialize(io) ⇒ `Tokenizer`

Creates a new tokenizer.

# File 'lib/hexapdf/tokenizer.rb', line 77

def initialize(io)
  @io = io
  @ss = StringScanner.new(''.b)
  @original_pos = -1
  self.pos = 0
end

Instance Attribute Details

#io ⇒ `Object` (readonly)

The IO object from the tokens are read.



74
75
76

# File 'lib/hexapdf/tokenizer.rb', line 74

def io
  @io
end

Instance Method Details

#next_byte ⇒ `Object`

Reads the byte (an integer) at the current position and advances the scan pointer.

# File 'lib/hexapdf/tokenizer.rb', line 214

def next_byte
  prepare_string_scanner(1)
  @ss.pos += 1
  @ss.string.getbyte(@ss.pos - 1)
end

#next_integer_or_keyword ⇒ `Object`

Returns a single integer or keyword token read from the current position and advances the scan pointer. If the current position doesn’t contain such a token, nil is returned without advancing the scan pointer. The value NO_MORE_TOKENS is returned if there are no more tokens available.

Initial runs of whitespace characters are ignored.

Note: This is a special method meant for use with reconstructing the cross-reference table!

# File 'lib/hexapdf/tokenizer.rb', line 199

def next_integer_or_keyword
  skip_whitespace
  byte = @ss.string.getbyte(@ss.pos) || -1
  if 48 <= byte && byte <= 57
    parse_number
  elsif (97 <= byte && byte <= 122) || (65 <= byte && byte <= 90)
    parse_keyword
  elsif byte == -1 # we reached the end of the file
    NO_MORE_TOKENS
  else
    nil
  end
end

#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ `Object`

Returns the PDF object at the current position. This is different from #next_token because references, arrays and dictionaries consist of multiple tokens.

If the allow_end_array_token argument is true, the ‘]’ token is permitted to facilitate the use of this method during array parsing.

See: PDF1.7 s7.3

# File 'lib/hexapdf/tokenizer.rb', line 168

def next_object(allow_end_array_token: false, allow_keyword: false)
  token = next_token

  if token.kind_of?(Token)
    case token
    when TOKEN_DICT_START
      token = parse_dictionary
    when TOKEN_ARRAY_START
      token = parse_array
    when TOKEN_ARRAY_END
      unless allow_end_array_token
        raise HexaPDF::MalformedPDFError.new("Found invalid end array token ']'", pos: pos)
      end
    else
      unless allow_keyword
        raise HexaPDF::MalformedPDFError.new("Invalid object, got token #{token}", pos: pos)
      end
    end
  end

  token
end

#next_token ⇒ `Object`

Returns a single token read from the current position and advances the scan pointer.

Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS is returned if there are no more tokens available.

# File 'lib/hexapdf/tokenizer.rb', line 110

def next_token
  prepare_string_scanner(20)
  prepare_string_scanner(20) while @ss.skip(WHITESPACE_MULTI_RE)
  byte = @ss.string.getbyte(@ss.pos) || -1
  if (48 <= byte && byte <= 57) || byte == 45 || byte == 43 || byte == 46 # 0..9 - + .
    parse_number
  elsif byte == 47 # /
    parse_name
  elsif byte == 40 # (
    parse_literal_string
  elsif byte == 60 # <
    if @ss.string.getbyte(@ss.pos + 1) != 60
      parse_hex_string
    else
      @ss.pos += 2
      TOKEN_DICT_START
    end
  elsif byte == 62 # >
    unless @ss.string.getbyte(@ss.pos + 1) == 62
      raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos)
    end
    @ss.pos += 2
    TOKEN_DICT_END
  elsif byte == 91 # [
    @ss.pos += 1
    TOKEN_ARRAY_START
  elsif byte == 93 # ]
    @ss.pos += 1
    TOKEN_ARRAY_END
  elsif byte == 123 || byte == 125 # { }
    Token.new(@ss.get_byte)
  elsif byte == 37 # %
    until @ss.skip_until(/(?=[\r\n])/)
      return NO_MORE_TOKENS unless prepare_string_scanner
    end
    next_token
  elsif byte == -1 # we reached the end of the file
    NO_MORE_TOKENS
  else # everything else consisting of regular characters
    parse_keyword
  end
end

#next_xref_entry ⇒ `Object`

Reads the cross-reference subsection entry at the current position and advances the scan pointer.

If a possible problem is detected, yields to caller.

See: PDF1.7 7.5.4

# File 'lib/hexapdf/tokenizer.rb', line 226

def next_xref_entry #:yield: matched_size
  prepare_string_scanner(20)
  unless @ss.skip(/(\d{10}) (\d{5}) ([nf])(?: \r| \n|\r\n|\r|\n)/) && @ss.matched_size == 20
    yield(@ss.matched_size)
  end
  [@ss[1].to_i, @ss[2].to_i, @ss[3]]
end

#peek_token ⇒ `Object`

Returns the next token but does not advance the scan pointer.

# File 'lib/hexapdf/tokenizer.rb', line 154

def peek_token
  pos = self.pos
  tok = next_token
  self.pos = pos
  tok
end

#pos ⇒ `Object`

Returns the current position of the tokenizer inside in the IO object.

Note that this position might be different from io.pos since the latter could have been changed somewhere else.



88
89
90

# File 'lib/hexapdf/tokenizer.rb', line 88

def pos
  @original_pos + @ss.pos
end

#pos=(pos) ⇒ `Object`

Sets the position at which the next token should be read.

Note that this does not set io.pos directly (at the moment of invocation)!

# File 'lib/hexapdf/tokenizer.rb', line 95

def pos=(pos)
  if pos >= @original_pos && pos <= @original_pos + @ss.string.size
    @ss.pos = pos - @original_pos
  else
    @original_pos = pos
    @next_read_pos = pos
    @ss.string.clear
    @ss.reset
  end
end

#scan_until(re) ⇒ `Object`

Utility method for scanning until the given regular expression matches.

If the end of the file is reached in the process, nil is returned. Otherwise the matched string is returned.

# File 'lib/hexapdf/tokenizer.rb', line 246

def scan_until(re)
  until (data = @ss.scan_until(re))
    return nil unless prepare_string_scanner
  end
  data
end

#skip_whitespace ⇒ `Object`

Skips all whitespace at the current position.

See: PDF1.7 s7.2.2

# File 'lib/hexapdf/tokenizer.rb', line 237

def skip_whitespace
  prepare_string_scanner
  prepare_string_scanner while @ss.skip(WHITESPACE_MULTI_RE)
end

Class: HexaPDF::Tokenizer

Overview

Direct Known Subclasses

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io) ⇒ Tokenizer

Instance Attribute Details

#io ⇒ Object (readonly)

Instance Method Details

#next_byte ⇒ Object

#next_integer_or_keyword ⇒ Object

#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object

#next_token ⇒ Object

#next_xref_entry ⇒ Object

#peek_token ⇒ Object

#pos ⇒ Object

#pos=(pos) ⇒ Object

#scan_until(re) ⇒ Object

#skip_whitespace ⇒ Object