Class: HexaPDF::Tokenizer
- Inherits:
-
Object
- Object
- HexaPDF::Tokenizer
- Defined in:
- lib/hexapdf/tokenizer.rb
Overview
Tokenizes the content of an IO object following the PDF rules.
See: PDF1.7 s7.2
Direct Known Subclasses
Defined Under Namespace
Classes: Token
Constant Summary collapse
- TOKEN_DICT_START =
:nodoc:
Token.new('<<'.b)
- TOKEN_DICT_END =
:nodoc:
Token.new('>>'.b)
- TOKEN_ARRAY_START =
:nodoc:
Token.new('['.b)
- TOKEN_ARRAY_END =
:nodoc:
Token.new(']'.b)
- NO_MORE_TOKENS =
This object is returned when there are no more tokens to read.
::Object.new
- WHITESPACE =
Characters defined as whitespace.
See: PDF1.7 s7.2.2
" \n\r\0\t\f"
- DELIMITER =
Characters defined as delimiters.
See: PDF1.7 s7.2.2
"()<>{}/[]%"
- WHITESPACE_MULTI_RE =
:nodoc:
/[#{WHITESPACE}]+/
- WHITESPACE_OR_DELIMITER_RE =
:nodoc:
/(?=[#{Regexp.escape(WHITESPACE + DELIMITER)}])/
Instance Attribute Summary collapse
-
#io ⇒ Object
readonly
The IO object from the tokens are read.
Instance Method Summary collapse
-
#initialize(io) ⇒ Tokenizer
constructor
Creates a new tokenizer.
-
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
-
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position.
-
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
-
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
-
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
-
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
-
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
-
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
-
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
Constructor Details
#initialize(io) ⇒ Tokenizer
Creates a new tokenizer.
74 75 76 77 78 79 |
# File 'lib/hexapdf/tokenizer.rb', line 74 def initialize(io) @io = io @ss = StringScanner.new(''.b) @original_pos = -1 self.pos = 0 end |
Instance Attribute Details
#io ⇒ Object (readonly)
The IO object from the tokens are read.
71 72 73 |
# File 'lib/hexapdf/tokenizer.rb', line 71 def io @io end |
Instance Method Details
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
189 190 191 192 193 |
# File 'lib/hexapdf/tokenizer.rb', line 189 def next_byte prepare_string_scanner(1) @ss.pos += 1 @ss.string.getbyte(@ss.pos - 1) end |
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position. This is different from #next_token because references, arrays and dictionaries consist of multiple tokens.
If the allow_end_array_token
argument is true
, the ‘]’ token is permitted to facilitate the use of this method during array parsing.
See: PDF1.7 s7.3
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
# File 'lib/hexapdf/tokenizer.rb', line 165 def next_object(allow_end_array_token: false, allow_keyword: false) token = next_token if token.kind_of?(Token) case token when TOKEN_DICT_START token = parse_dictionary when TOKEN_ARRAY_START token = parse_array when TOKEN_ARRAY_END unless allow_end_array_token raise HexaPDF::MalformedPDFError.new("Found invalid end array token ']'", pos: pos) end else unless allow_keyword raise HexaPDF::MalformedPDFError.new("Invalid object, got token #{token}", pos: pos) end end end token end |
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS
is returned if there are no more tokens available.
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
# File 'lib/hexapdf/tokenizer.rb', line 107 def next_token prepare_string_scanner(20) prepare_string_scanner(20) while @ss.skip(WHITESPACE_MULTI_RE) byte = @ss.string.getbyte(@ss.pos) || -1 if (48 <= byte && byte <= 57) || byte == 45 || byte == 43 || byte == 46 # 0..9 - + . parse_number elsif byte == 47 # / parse_name elsif byte == 40 # ( parse_literal_string elsif byte == 60 # < if @ss.string.getbyte(@ss.pos + 1) != 60 parse_hex_string else @ss.pos += 2 TOKEN_DICT_START end elsif byte == 62 # > unless @ss.string.getbyte(@ss.pos + 1) == 62 raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos) end @ss.pos += 2 TOKEN_DICT_END elsif byte == 91 # [ @ss.pos += 1 TOKEN_ARRAY_START elsif byte == 93 # ] @ss.pos += 1 TOKEN_ARRAY_END elsif byte == 123 || byte == 125 # { } Token.new(@ss.get_byte) elsif byte == 37 # % until @ss.skip_until(/(?=[\r\n])/) return NO_MORE_TOKENS unless prepare_string_scanner end next_token elsif byte == -1 # we reached the end of the file NO_MORE_TOKENS else # everything else consisting of regular characters parse_keyword end end |
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
If a possible problem is detected, yields to caller.
See: PDF1.7 7.5.4
201 202 203 204 205 206 207 |
# File 'lib/hexapdf/tokenizer.rb', line 201 def next_xref_entry #:yield: matched_size prepare_string_scanner(20) unless @ss.skip(/(\d{10}) (\d{5}) ([nf])(?: \r| \n|\r\n|\r|\n)/) && @ss.matched_size == 20 yield(@ss.matched_size) end [@ss[1].to_i, @ss[2].to_i, @ss[3]] end |
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
151 152 153 154 155 156 |
# File 'lib/hexapdf/tokenizer.rb', line 151 def peek_token pos = self.pos tok = next_token self.pos = pos tok end |
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
Note that this position might be different from io.pos
since the latter could have been changed somewhere else.
85 86 87 |
# File 'lib/hexapdf/tokenizer.rb', line 85 def pos @original_pos + @ss.pos end |
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
Note that this does not set io.pos
directly (at the moment of invocation)!
92 93 94 95 96 97 98 99 100 101 |
# File 'lib/hexapdf/tokenizer.rb', line 92 def pos=(pos) if pos >= @original_pos && pos <= @original_pos + @ss.string.size @ss.pos = pos - @original_pos else @original_pos = pos @next_read_pos = pos @ss.string.clear @ss.reset end end |
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
If the end of the file is reached in the process, nil
is returned. Otherwise the matched string is returned.
221 222 223 224 225 226 |
# File 'lib/hexapdf/tokenizer.rb', line 221 def scan_until(re) until (data = @ss.scan_until(re)) return nil unless prepare_string_scanner end data end |
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
See: PDF1.7 s7.2.2
212 213 214 215 |
# File 'lib/hexapdf/tokenizer.rb', line 212 def skip_whitespace prepare_string_scanner prepare_string_scanner while @ss.skip(WHITESPACE_MULTI_RE) end |