Class: HexaPDF::Tokenizer
- Inherits:
-
Object
- Object
- HexaPDF::Tokenizer
- Defined in:
- lib/hexapdf/tokenizer.rb
Overview
Tokenizes the content of an IO object following the PDF rules.
See: PDF1.7 s7.2
Direct Known Subclasses
Defined Under Namespace
Classes: Token
Constant Summary collapse
- TOKEN_DICT_START =
:nodoc:
Token.new('<<'.b)
- TOKEN_DICT_END =
:nodoc:
Token.new('>>'.b)
- TOKEN_ARRAY_START =
:nodoc:
Token.new('['.b)
- TOKEN_ARRAY_END =
:nodoc:
Token.new(']'.b)
- NO_MORE_TOKENS =
This object is returned when there are no more tokens to read.
::Object.new
- WHITESPACE =
Characters defined as whitespace.
See: PDF1.7 s7.2.2
" \n\r\0\t\f"- DELIMITER =
Characters defined as delimiters.
See: PDF1.7 s7.2.2
"()<>{}/[]%"- WHITESPACE_MULTI_RE =
:nodoc:
/[#{WHITESPACE}]+/- WHITESPACE_OR_DELIMITER_RE =
:nodoc:
/(?=[#{Regexp.escape(WHITESPACE + DELIMITER)}])/
Instance Attribute Summary collapse
-
#io ⇒ Object
readonly
The IO object from the tokens are read.
Instance Method Summary collapse
-
#initialize(io) ⇒ Tokenizer
constructor
Creates a new tokenizer.
-
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
-
#next_integer_or_keyword ⇒ Object
Returns a single integer or keyword token read from the current position and advances the scan pointer.
-
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position.
-
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
-
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
-
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
-
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
-
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
-
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
-
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
Constructor Details
#initialize(io) ⇒ Tokenizer
Creates a new tokenizer.
77 78 79 80 81 82 |
# File 'lib/hexapdf/tokenizer.rb', line 77 def initialize(io) @io = io @ss = StringScanner.new(''.b) @original_pos = -1 self.pos = 0 end |
Instance Attribute Details
#io ⇒ Object (readonly)
The IO object from the tokens are read.
74 75 76 |
# File 'lib/hexapdf/tokenizer.rb', line 74 def io @io end |
Instance Method Details
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
214 215 216 217 218 |
# File 'lib/hexapdf/tokenizer.rb', line 214 def next_byte prepare_string_scanner(1) @ss.pos += 1 @ss.string.getbyte(@ss.pos - 1) end |
#next_integer_or_keyword ⇒ Object
Returns a single integer or keyword token read from the current position and advances the scan pointer. If the current position doesn’t contain such a token, nil is returned without advancing the scan pointer. The value NO_MORE_TOKENS is returned if there are no more tokens available.
Initial runs of whitespace characters are ignored.
Note: This is a special method meant for use with reconstructing the cross-reference table!
199 200 201 202 203 204 205 206 207 208 209 210 211 |
# File 'lib/hexapdf/tokenizer.rb', line 199 def next_integer_or_keyword skip_whitespace byte = @ss.string.getbyte(@ss.pos) || -1 if 48 <= byte && byte <= 57 parse_number elsif (97 <= byte && byte <= 122) || (65 <= byte && byte <= 90) parse_keyword elsif byte == -1 # we reached the end of the file NO_MORE_TOKENS else nil end end |
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position. This is different from #next_token because references, arrays and dictionaries consist of multiple tokens.
If the allow_end_array_token argument is true, the ‘]’ token is permitted to facilitate the use of this method during array parsing.
See: PDF1.7 s7.3
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
# File 'lib/hexapdf/tokenizer.rb', line 168 def next_object(allow_end_array_token: false, allow_keyword: false) token = next_token if token.kind_of?(Token) case token when TOKEN_DICT_START token = parse_dictionary when TOKEN_ARRAY_START token = parse_array when TOKEN_ARRAY_END unless allow_end_array_token raise HexaPDF::MalformedPDFError.new("Found invalid end array token ']'", pos: pos) end else unless allow_keyword raise HexaPDF::MalformedPDFError.new("Invalid object, got token #{token}", pos: pos) end end end token end |
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS is returned if there are no more tokens available.
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# File 'lib/hexapdf/tokenizer.rb', line 110 def next_token prepare_string_scanner(20) prepare_string_scanner(20) while @ss.skip(WHITESPACE_MULTI_RE) byte = @ss.string.getbyte(@ss.pos) || -1 if (48 <= byte && byte <= 57) || byte == 45 || byte == 43 || byte == 46 # 0..9 - + . parse_number elsif byte == 47 # / parse_name elsif byte == 40 # ( parse_literal_string elsif byte == 60 # < if @ss.string.getbyte(@ss.pos + 1) != 60 parse_hex_string else @ss.pos += 2 TOKEN_DICT_START end elsif byte == 62 # > unless @ss.string.getbyte(@ss.pos + 1) == 62 raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos) end @ss.pos += 2 TOKEN_DICT_END elsif byte == 91 # [ @ss.pos += 1 TOKEN_ARRAY_START elsif byte == 93 # ] @ss.pos += 1 TOKEN_ARRAY_END elsif byte == 123 || byte == 125 # { } Token.new(@ss.get_byte) elsif byte == 37 # % until @ss.skip_until(/(?=[\r\n])/) return NO_MORE_TOKENS unless prepare_string_scanner end next_token elsif byte == -1 # we reached the end of the file NO_MORE_TOKENS else # everything else consisting of regular characters parse_keyword end end |
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
If a possible problem is detected, yields to caller.
See: PDF1.7 7.5.4
226 227 228 229 230 231 232 |
# File 'lib/hexapdf/tokenizer.rb', line 226 def next_xref_entry #:yield: matched_size prepare_string_scanner(20) unless @ss.skip(/(\d{10}) (\d{5}) ([nf])(?: \r| \n|\r\n|\r|\n)/) && @ss.matched_size == 20 yield(@ss.matched_size) end [@ss[1].to_i, @ss[2].to_i, @ss[3]] end |
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
154 155 156 157 158 159 |
# File 'lib/hexapdf/tokenizer.rb', line 154 def peek_token pos = self.pos tok = next_token self.pos = pos tok end |
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
Note that this position might be different from io.pos since the latter could have been changed somewhere else.
88 89 90 |
# File 'lib/hexapdf/tokenizer.rb', line 88 def pos @original_pos + @ss.pos end |
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
Note that this does not set io.pos directly (at the moment of invocation)!
95 96 97 98 99 100 101 102 103 104 |
# File 'lib/hexapdf/tokenizer.rb', line 95 def pos=(pos) if pos >= @original_pos && pos <= @original_pos + @ss.string.size @ss.pos = pos - @original_pos else @original_pos = pos @next_read_pos = pos @ss.string.clear @ss.reset end end |
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
If the end of the file is reached in the process, nil is returned. Otherwise the matched string is returned.
246 247 248 249 250 251 |
# File 'lib/hexapdf/tokenizer.rb', line 246 def scan_until(re) until (data = @ss.scan_until(re)) return nil unless prepare_string_scanner end data end |
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
See: PDF1.7 s7.2.2
237 238 239 240 |
# File 'lib/hexapdf/tokenizer.rb', line 237 def skip_whitespace prepare_string_scanner prepare_string_scanner while @ss.skip(WHITESPACE_MULTI_RE) end |