Class: Janeway::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/janeway/lexer.rb

Overview

Transforms source code into tokens

Defined Under Namespace

Classes: Error

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Lexer

Returns a new instance of Lexer.



67
68
69
70
71
72
# File 'lib/janeway/lexer.rb', line 67

def initialize(source)
  @source = source
  @tokens = []
  @next_p = 0
  @lexeme_start_p = 0
end

Instance Attribute Details

#lexeme_start_pObject

Returns the value of attribute lexeme_start_p.



53
54
55
# File 'lib/janeway/lexer.rb', line 53

def lexeme_start_p
  @lexeme_start_p
end

#next_pObject

Returns the value of attribute next_p.



53
54
55
# File 'lib/janeway/lexer.rb', line 53

def next_p
  @next_p
end

#sourceObject (readonly)

Returns the value of attribute source.



52
53
54
# File 'lib/janeway/lexer.rb', line 52

def source
  @source
end

#tokensObject (readonly)

Returns the value of attribute tokens.



52
53
54
# File 'lib/janeway/lexer.rb', line 52

def tokens
  @tokens
end

Class Method Details

.lex(query) ⇒ Array<Token>

Tokenize and return the token list.

Parameters:

  • query (String)

    jsonpath query

Returns:

Raises:

  • (ArgumentError)


59
60
61
62
63
64
65
# File 'lib/janeway/lexer.rb', line 59

def self.lex(query)
  raise ArgumentError, "expect string, got #{query.inspect}" unless query.is_a?(String)

  lexer = new(query)
  lexer.start_tokenization
  lexer.tokens
end

Instance Method Details

#after_source_end_locationObject



509
510
511
# File 'lib/janeway/lexer.rb', line 509

def after_source_end_location
  Location.new(next_p, 1)
end

#alpha_numeric?(lexeme) ⇒ Boolean

Returns:

  • (Boolean)


113
114
115
# File 'lib/janeway/lexer.rb', line 113

def alpha_numeric?(lexeme)
  ALPHABET.include?(lexeme) || DIGITS.include?(lexeme)
end

#consumeObject



158
159
160
161
162
# File 'lib/janeway/lexer.rb', line 158

def consume
  c = lookahead
  @next_p += 1
  c
end

#consume_digitsObject



164
165
166
# File 'lib/janeway/lexer.rb', line 164

def consume_digits
  consume while digit?(lookahead)
end

#consume_escape_sequenceString

Read escape char literals, and transform them into the described character

Returns:

  • (String)

    single character (possibly multi-byte)



213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# File 'lib/janeway/lexer.rb', line 213

def consume_escape_sequence
  raise err('Expect escape sequence') unless consume == '\\'

  char = consume
  case char
  when 'b' then "\b"
  when 'f' then "\f"
  when 'n' then "\n"
  when 'r' then "\r"
  when 't' then "\t"
  when '/', '\\', '"', "'" then char
  when 'u' then consume_unicode_escape_sequence
  else
    raise err("Character #{char} must not be escaped") if unescaped?(char)

    # whatever this is, it is not allowed even when escaped
    raise err("Invalid character #{char.inspect}")
  end
end

#consume_four_hex_digitsString

Consume and return 4 hex digits from the source. Either upper or lower case is accepted. No judgment is made here on whether the resulting sequence is valid, as long as it is 4 hex digits.

Returns:

  • (String)


318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# File 'lib/janeway/lexer.rb', line 318

def consume_four_hex_digits
  hex_digits = []
  4.times do
    hex_digits << consume
    case hex_digits.last.ord
    when 0x30..0x39 then next # '0'..'1'
    when 0x40..0x46 then next # 'A'..'F'
    when 0x61..0x66 then next # 'a'..'f'
    else
      raise err("Invalid unicode escape sequence: \\u#{hex_digits.join}")
    end
  end
  raise err("Incomplete unicode escape sequence: \\u#{hex_digits.join}") if hex_digits.size < 4

  hex_digits.join
end

#consume_unicode_escape_sequenceString

Consume a unicode escape that matches this ABNF grammar: www.rfc-editor.org/rfc/rfc9535.html#section-2.3.1.1-2

hexchar             = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
non-surrogate       = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG) /
                      ("D" %x30-37 2HEXDIG )
high-surrogate      = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate       = "D" ("C"/"D"/"E"/"F") 2HEXDIG

HEXDIG              = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

Both lower and uppercase are allowed. The grammar does now show this here but clarifies that in a following note.

The preceding ‘u` prefix has already been consumed.

Returns:

  • (String)

    single character (possibly multi-byte)



250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
# File 'lib/janeway/lexer.rb', line 250

def consume_unicode_escape_sequence
  # return a non-surrogate sequence
  hex_str = consume_four_hex_digits
  return hex_str.hex.chr('UTF-8') unless hex_str.upcase.start_with?('D')

  # hex string starts with D, but is still non-surrogate
  return [hex_str.hex].pack('U') if '01234567'.include?(hex_str[1])

  # hex value is in the high-surrogate or low-surrogate range.

  if high_surrogate?(hex_str)
    # valid, as long as it is followed by \u low-surrogate
    prefix = [consume, consume].join
    hex_str2 = consume_four_hex_digits

    # This is a high-surrogate followed by a low-surrogate, which is valid.
    # This is the UTF-16 method of representing certain high unicode codepoints.
    # However this specific byte sequence is not a valid way to represent that same
    # unicode character in the UTF-8 encoding.
    # The surrogate pair must be converted into the correct UTF-8 code point.
    # This returns a UTF-8 string containing a single unicode character.
    return convert_surrogate_pair_to_codepoint(hex_str, hex_str2) if prefix == '\\u' && low_surrogate?(hex_str2)

    # Not allowed to have high surrogate that is not followed by low surrogate
    raise err("Invalid unicode escape sequence: \\u#{hex_str2}")

  end
  # Not allowed to have low surrogate that is not preceded by high surrogate
  raise err("Invalid unicode escape sequence: \\u#{hex_str}")
end

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.

Parameters:

  • high_surrogate_hex (String)

    string of hex digits, eg. “D83D”

  • low_surrogate_hex (String)

    string of hex digits, eg. “DE09”

Returns:

  • (String)

    UTF-8 string containing a single multi-byte unicode character, eg. “😉”



286
287
288
289
290
291
292
293
294
295
296
297
# File 'lib/janeway/lexer.rb', line 286

def convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex)
  [high_surrogate_hex, low_surrogate_hex].each do |hex_str|
    raise ArgumentError, "expect 4 hex digits, got #{hex_string.inspect}" unless hex_str.size == 4
  end

  # Calculate the code point from the surrogate pair values
  # algorithm from https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
  high = high_surrogate_hex.hex
  low = low_surrogate_hex.hex
  codepoint = ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
  [codepoint].pack('U') # convert integer codepoint to single character string
end

#current_locationObject



505
506
507
# File 'lib/janeway/lexer.rb', line 505

def current_location
  Location.new(lexeme_start_p, next_p - lexeme_start_p)
end

#digit?(lexeme) ⇒ Boolean

Returns:

  • (Boolean)


109
110
111
# File 'lib/janeway/lexer.rb', line 109

def digit?(lexeme)
  DIGITS.include?(lexeme)
end

#err(msg) ⇒ Lexer::Error

Return a Lexer::Error with the specified message, include the query and location

Parameters:

  • msg (String)

    error message

Returns:



517
518
519
# File 'lib/janeway/lexer.rb', line 517

def err(msg)
  Error.new(msg, @source, current_location)
end

#escapable?(char) ⇒ Boolean

Returns:

  • (Boolean)


430
431
432
433
434
435
436
437
438
439
440
441
# File 'lib/janeway/lexer.rb', line 430

def escapable?(char)
  case char.ord
  when 0x62 then true # backspace
  when 0x66 then true # form feed
  when 0x6E then true # line feed
  when 0x72 then true # carriage return
  when 0x74 then true # horizontal tab
  when 0x2F then true # slash
  when 0x5C then true # backslash
  else false
  end
end

#high_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “high-surrogate”

Returns:

  • (Boolean)


300
301
302
303
304
# File 'lib/janeway/lexer.rb', line 300

def high_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[D8 D9 DA DB].include?(hex_digits[0..1].upcase)
end

#lex_delimited_string(delimiter) ⇒ Token

Returns string token.

Parameters:

  • delimiter (String)

    eg. ‘ or “

Returns:

  • (Token)

    string token



170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/janeway/lexer.rb', line 170

def lex_delimited_string(delimiter)
  allowed_delimiters = %w[' "]
  # the "other" delimiter char, which is not currently being treated as a delimiter
  non_delimiter = allowed_delimiters.reject { _1 == delimiter }.first

  literal_chars = []
  while lookahead != delimiter && source_uncompleted?
    # Transform escaped representation to literal chars
    next_char = lookahead
    literal_chars <<
      if next_char == '\\'
        if lookahead(2) == delimiter
          consume # \
          consume # delimiter
        elsif lookahead(2) == non_delimiter
          qtype = delimiter == '"' ? 'double' : 'single'
          raise err("Character #{non_delimiter} must not be escaped within #{qtype} quotes")
        else
          consume_escape_sequence # consumes multiple chars
        end
      elsif unescaped?(next_char)
        consume
      elsif allowed_delimiters.include?(next_char) && next_char != delimiter
        consume
      else
        raise err("invalid character #{next_char.inspect}")
      end
  end
  raise err("Unterminated string error: #{literal_chars.join.inspect}") if source_completed?

  consume # closing delimiter

  # literal value omits delimiters and includes un-escaped values
  literal = literal_chars.join

  # lexeme value includes delimiters and literal escape characters
  lexeme = source[lexeme_start_p..(next_p - 1)]

  Token.new(:string, lexeme, literal, current_location)
end

#lex_identifier(ignore_keywords: false) ⇒ Object

Consume an alphanumeric string. If ‘ignore_keywords`, the result is always an :identifier token. Otherwise, keywords and function names will be recognized and tokenized as those types.

Parameters:

  • ignore_keywords (Boolean) (defaults to: false)


380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
# File 'lib/janeway/lexer.rb', line 380

def lex_identifier(ignore_keywords: false)
  consume while alpha_numeric?(lookahead)

  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier) && !ignore_keywords
      identifier.to_sym
    elsif FUNCTIONS.include?(identifier) && !ignore_keywords
      :function
    else
      :identifier
    end

  Token.new(type, identifier, identifier, current_location)
end

#lex_member_name_shorthand(ignore_keywords: false) ⇒ Token

Lex a member name that is found within dot notation.

Recognize keywords and given them the correct type.

Parameters:

  • ignore_keywords (Boolean) (defaults to: false)

Returns:

See Also:



479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
# File 'lib/janeway/lexer.rb', line 479

def lex_member_name_shorthand(ignore_keywords: false)
  consume while name_char?(lookahead)
  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier) && !ignore_keywords
      identifier.to_sym
    elsif FUNCTIONS.include?(identifier) && !ignore_keywords
      :function
    else
      :identifier
    end
  if type == :function && WHITESPACE.include?(lookahead)
    raise err("Function name \"#{identifier}\" must not be followed by whitespace")
  end

  Token.new(type, identifier, identifier, current_location)
end

#lex_numberObject

Consume a numeric string. May be an integer, fractional, or exponent.

number = (int / "-0") [ frac ] [ exp ] ; decimal number
frac   = "." 1*DIGIT                   ; decimal fraction
exp    = "e" [ "-" / "+" ] 1*DIGIT     ; decimal exponent


339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
# File 'lib/janeway/lexer.rb', line 339

def lex_number
  consume_digits

  # Look for a fractional part
  if lookahead == '.' && digit?(lookahead(2))
    consume # "."
    consume_digits
  end

  # Look for an exponent part
  if 'Ee'.include?(lookahead)
    consume # "e", "E"
    if %w[+ -].include?(lookahead)
      consume # "+" / "-"
    end
    unless digit?(lookahead)
      lexeme = source[lexeme_start_p..(next_p - 1)]
      raise err("Exponent 'e' must be followed by number: #{lexeme.inspect}")
    end
    consume_digits
  end

  lexeme = source[lexeme_start_p..(next_p - 1)]
  if lexeme.start_with?('0') && lexeme.size > 1
    raise err("Number may not start with leading zero: #{lexeme.inspect}")
  end

  literal =
    if lexeme.include?('.') || lexeme.downcase.include?('e')
      lexeme.to_f
    else
      lexeme.to_i
    end
  Token.new(:number, lexeme, literal, current_location)
end

#lex_unescaped_identifierToken

Parse an identifier string which is not within delimiters. The standard set of unicode code points are allowed. No character escapes are allowed. Keywords and function names are ignored in this context.

Returns:



401
402
403
404
405
# File 'lib/janeway/lexer.rb', line 401

def lex_unescaped_identifier
  consume while unescaped?(lookahead)
  identifier = source[lexeme_start_p..(next_p - 1)]
  Token.new(:identifier, identifier, identifier, current_location)
end

#lookahead(offset = 1) ⇒ Object



117
118
119
120
121
122
# File 'lib/janeway/lexer.rb', line 117

def lookahead(offset = 1)
  lookahead_p = (next_p - 1) + offset
  return "\0" if lookahead_p >= source.length

  source[lookahead_p]
end

#low_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “low-surrogate”

Returns:

  • (Boolean)


307
308
309
310
311
# File 'lib/janeway/lexer.rb', line 307

def low_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[DC DD DE DF].include?(hex_digits[0..1].upcase)
end

#name_char?(char) ⇒ Boolean

True if character is acceptable in a name selector using shorthand notation (ie. no bracket notation.) This is the same set as #name_first_char? except that it also allows numbers

Parameters:

  • char (String)

    single character, possibly multi-byte

Returns:

  • (Boolean)


465
466
467
468
469
470
# File 'lib/janeway/lexer.rb', line 465

def name_char?(char)
  NAME_FIRST.include?(char) \
    || DIGITS.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#name_first_char?(char) ⇒ Boolean

True if character is suitable as the first character in a name selector using shorthand notation (ie. no bracket notation.)

Defined in RFC9535 by this ABNF grammar: name-first = ALPHA /

"_"   /
%x80-D7FF /
   ; skip surrogate code points
%xE000-10FFFF

Parameters:

  • char (String)

    single character, possibly multi-byte

Returns:

  • (Boolean)


455
456
457
458
459
# File 'lib/janeway/lexer.rb', line 455

def name_first_char?(char)
  NAME_FIRST.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#source_completed?Boolean

Returns:

  • (Boolean)


497
498
499
# File 'lib/janeway/lexer.rb', line 497

def source_completed?
  next_p >= source.length # our pointer starts at 0, so the last char is length - 1.
end

#source_uncompleted?Boolean

Returns:

  • (Boolean)


501
502
503
# File 'lib/janeway/lexer.rb', line 501

def source_uncompleted?
  !source_completed?
end

#start_tokenizationObject



74
75
76
77
78
79
80
81
# File 'lib/janeway/lexer.rb', line 74

def start_tokenization
  if WHITESPACE.include?(@source[0]) || WHITESPACE.include?(@source[-1])
    raise err('JSONPath query may not start or end with whitespace')
  end

  tokenize while source_uncompleted?
  tokens << Token.new(:eof, '', nil, after_source_end_location)
end

#token_from_one_char_lex(lexeme) ⇒ Object



124
125
126
127
128
129
130
# File 'lib/janeway/lexer.rb', line 124

def token_from_one_char_lex(lexeme)
  if %w[. -].include?(lexeme) && WHITESPACE.include?(lookahead)
    raise err("Operator #{lexeme.inspect} must not be followed by whitespace")
  end

  Token.new(OPERATORS.key(lexeme), lexeme, nil, current_location)
end

#token_from_one_or_two_char_lex(lexeme) ⇒ Token

Consumes an operator that could be either 1 or 2 chars in length

Returns:



134
135
136
137
138
139
140
141
142
143
144
145
146
# File 'lib/janeway/lexer.rb', line 134

def token_from_one_or_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  if TWO_CHAR_LEX.include?(next_two_chars)
    consume
    if next_two_chars == '..' && WHITESPACE.include?(lookahead)
      raise err("Operator #{next_two_chars.inspect} must not be followed by whitespace")
    end

    Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
  else
    token_from_one_char_lex(lexeme)
  end
end

#token_from_two_char_lex(lexeme) ⇒ Token

Consumes a 2 char operator

Returns:



150
151
152
153
154
155
156
# File 'lib/janeway/lexer.rb', line 150

def token_from_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  raise err("Unknown operator \"#{lexeme}\"") unless TWO_CHAR_LEX.include?(next_two_chars)

  consume
  Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
end

#tokenizeObject

Read a token from the @source, increment the pointers.



84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/janeway/lexer.rb', line 84

def tokenize
  self.lexeme_start_p = next_p

  c = consume
  return if WHITESPACE.include?(c)

  token =
    if ONE_OR_TWO_CHAR_LEX.include?(c)
      token_from_one_or_two_char_lex(c)
    elsif ONE_CHAR_LEX.include?(c)
      token_from_one_char_lex(c)
    elsif TWO_CHAR_LEX_FIRST.include?(c)
      token_from_two_char_lex(c)
    elsif %w[" '].include?(c)
      lex_delimited_string(c)
    elsif digit?(c)
      lex_number
    elsif name_first_char?(c)
      lex_member_name_shorthand(ignore_keywords: tokens.last&.type == :dot)
    end
  raise err("Unknown character: #{c.inspect}") unless token

  tokens << token
end

#unescaped?(char) ⇒ Boolean

Return true if string matches the definition of “unescaped” from RFC9535: unescaped = %x20-21 / ; see RFC 8259

   ; omit 0x22 "
%x23-26 /
   ; omit 0x27 '
%x28-5B /
   ; omit 0x5C \
%x5D-D7FF /
   ; skip surrogate code points
%xE000-10FFFF

Parameters:

  • char (String)

    single character, possibly multi-byte

Returns:

  • (Boolean)


418
419
420
421
422
423
424
425
426
427
428
# File 'lib/janeway/lexer.rb', line 418

def unescaped?(char)
  case char.ord
  when 0x20..0x21 then true # space, "!"
  when 0x23..0x26 then true # "#", "$", "%"
  when 0x28..0x5B then true # "(" ... "["
  when 0x5D..0xD7FF then true # remaining ascii and lots of unicode code points
    # omit surrogate code points
  when 0xE000..0x10FFFF then true # much more unicode code points
  else false
  end
end