Class: Janeway::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/janeway/lexer.rb

Overview

Transforms source code into tokens

Defined Under Namespace

Classes: Error

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Lexer



67
68
69
70
71
72
# File 'lib/janeway/lexer.rb', line 67

def initialize(source)
  @source = source
  @tokens = []
  @next_p = 0
  @lexeme_start_p = 0
end

Instance Attribute Details

#lexeme_start_pObject

Returns the value of attribute lexeme_start_p.



53
54
55
# File 'lib/janeway/lexer.rb', line 53

def lexeme_start_p
  @lexeme_start_p
end

#next_pObject

Returns the value of attribute next_p.



53
54
55
# File 'lib/janeway/lexer.rb', line 53

def next_p
  @next_p
end

#sourceObject (readonly)

Returns the value of attribute source.



52
53
54
# File 'lib/janeway/lexer.rb', line 52

def source
  @source
end

#tokensObject (readonly)

Returns the value of attribute tokens.



52
53
54
# File 'lib/janeway/lexer.rb', line 52

def tokens
  @tokens
end

Class Method Details

.lex(query) ⇒ Array<Token>

Tokenize and return the token list.

Raises:

  • (ArgumentError)


59
60
61
62
63
64
65
# File 'lib/janeway/lexer.rb', line 59

def self.lex(query)
  raise ArgumentError, "expect string, got #{query.inspect}" unless query.is_a?(String)

  lexer = new(query)
  lexer.start_tokenization
  lexer.tokens
end

Instance Method Details

#after_source_end_locationObject



513
514
515
# File 'lib/janeway/lexer.rb', line 513

def after_source_end_location
  Location.new(next_p, 1)
end

#alpha_numeric?(lexeme) ⇒ Boolean



116
117
118
# File 'lib/janeway/lexer.rb', line 116

def alpha_numeric?(lexeme)
  ALPHABET.include?(lexeme) || DIGITS.include?(lexeme)
end

#consumeObject



162
163
164
165
166
# File 'lib/janeway/lexer.rb', line 162

def consume
  c = lookahead
  @next_p += 1
  c
end

#consume_digitsObject



168
169
170
# File 'lib/janeway/lexer.rb', line 168

def consume_digits
  consume while digit?(lookahead)
end

#consume_escape_sequenceString

Read escape char literals, and transform them into the described character



215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# File 'lib/janeway/lexer.rb', line 215

def consume_escape_sequence
  raise err('Expect escape sequence') unless consume == '\\'

  char = consume
  case char
  when 'b' then "\b"
  when 'f' then "\f"
  when 'n' then "\n"
  when 'r' then "\r"
  when 't' then "\t"
  when '/', '\\', '"', "'" then char
  when 'u' then consume_unicode_escape_sequence
  else
    if unescaped?(char)
      raise err("Character #{char} must not be escaped")
    else
      # whatever this is, it is not allowed even when escaped
      raise err("Invalid character #{char.inspect}")
    end
  end
end

#consume_four_hex_digitsString

Consume and return 4 hex digits from the source. Either upper or lower case is accepted. No judgment is made here on whether the resulting sequence is valid, as long as it is 4 hex digits.



322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
# File 'lib/janeway/lexer.rb', line 322

def consume_four_hex_digits
  hex_digits = []
  4.times do
    hex_digits << consume
    case hex_digits.last.ord
    when 0x30..0x39 then next # '0'..'1'
    when 0x40..0x46 then next # 'A'..'F'
    when 0x61..0x66 then next # 'a'..'f'
    else
      raise err("Invalid unicode escape sequence: \\u#{hex_digits.join}")
    end
  end
  raise err("Incomplete unicode escape sequence: \\u#{hex_digits.join}") if hex_digits.size < 4

  hex_digits.join
end

#consume_unicode_escape_sequenceString

Consume a unicode escape that matches this ABNF grammar: www.rfc-editor.org/rfc/rfc9535.html#section-2.3.1.1-2

hexchar             = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
non-surrogate       = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG) /
                      ("D" %x30-37 2HEXDIG )
high-surrogate      = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate       = "D" ("C"/"D"/"E"/"F") 2HEXDIG

HEXDIG              = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

Both lower and uppercase are allowed. The grammar does now show this here but clarifies that in a following note.

The preceding \u prefix has already been consumed.



254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
# File 'lib/janeway/lexer.rb', line 254

def consume_unicode_escape_sequence
  # return a non-surrogate sequence
  hex_str = consume_four_hex_digits
  return hex_str.hex.chr('UTF-8') unless hex_str.upcase.start_with?('D')

  # hex string starts with D, but is still non-surrogate
  return [hex_str.hex].pack('U') if '01234567'.include?(hex_str[1])

  # hex value is in the high-surrogate or low-surrogate range.

  if high_surrogate?(hex_str)
    # valid, as long as it is followed by \u low-surrogate
    prefix = [consume, consume].join
    hex_str2 = consume_four_hex_digits

    # This is a high-surrogate followed by a low-surrogate, which is valid.
    # This is the UTF-16 method of representing certain high unicode codepoints.
    # However this specific byte sequence is not a valid way to represent that same
    # unicode character in the UTF-8 encoding.
    # The surrogate pair must be converted into the correct UTF-8 code point.
    # This returns a UTF-8 string containing a single unicode character.
    return convert_surrogate_pair_to_codepoint(hex_str, hex_str2) if prefix == '\\u' && low_surrogate?(hex_str2)

    # Not allowed to have high surrogate that is not followed by low surrogate
    raise err("Invalid unicode escape sequence: \\u#{hex_str2}")

  end
  # Not allowed to have low surrogate that is not preceded by high surrogate
  raise err("Invalid unicode escape sequence: \\u#{hex_str}")
end

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.



290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/janeway/lexer.rb', line 290

def convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex)
  [high_surrogate_hex, low_surrogate_hex].each do |hex_str|
    raise ArgumentError, "expect 4 hex digits, got #{hex_string.inspect}" unless hex_str.size == 4
  end

  # Calculate the code point from the surrogate pair values
  # algorithm from https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
  high = high_surrogate_hex.hex
  low = low_surrogate_hex.hex
  codepoint = ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
  [codepoint].pack('U') # convert integer codepoint to single character string
end

#current_locationObject



509
510
511
# File 'lib/janeway/lexer.rb', line 509

def current_location
  Location.new(lexeme_start_p, next_p - lexeme_start_p)
end

#digit?(lexeme) ⇒ Boolean



112
113
114
# File 'lib/janeway/lexer.rb', line 112

def digit?(lexeme)
  DIGITS.include?(lexeme)
end

#err(msg) ⇒ Lexer::Error

Return a Lexer::Error with the specified message, include the query and location



521
522
523
# File 'lib/janeway/lexer.rb', line 521

def err(msg)
  Error.new(msg, @source, current_location)
end

#escapable?(char) ⇒ Boolean



434
435
436
437
438
439
440
441
442
443
444
445
# File 'lib/janeway/lexer.rb', line 434

def escapable?(char)
  case char.ord
  when 0x62 then true # backspace
  when 0x66 then true # form feed
  when 0x6E then true # line feed
  when 0x72 then true # carriage return
  when 0x74 then true # horizontal tab
  when 0x2F then true # slash
  when 0x5C then true # backslash
  else false
  end
end

#high_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “high-surrogate”



304
305
306
307
308
# File 'lib/janeway/lexer.rb', line 304

def high_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[D8 D9 DA DB].include?(hex_digits[0..1].upcase)
end

#lex_delimited_string(delimiter) ⇒ Token



174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# File 'lib/janeway/lexer.rb', line 174

def lex_delimited_string(delimiter)
  non_delimiter = %w[' "].reject { _1 == delimiter }.first

  literal_chars = []
  while lookahead != delimiter && source_uncompleted?
    # Transform escaped representation to literal chars
    next_char = lookahead
    literal_chars <<
      if next_char == '\\'
        if lookahead(2) == delimiter
          consume # \
          consume # delimiter
        elsif lookahead(2) == non_delimiter
          qtype = delimiter == '"' ? 'double' : 'single'
          raise err("Character #{non_delimiter} must not be escaped within #{qtype} quotes")
        else
          consume_escape_sequence # consumes multiple chars
        end
      elsif unescaped?(next_char)
        consume
      elsif %w[' "].include?(next_char) && next_char != delimiter
        consume
      else
        raise err("invalid character #{next_char.inspect}")
      end
  end
  raise err("Unterminated string error: #{literal_chars.join.inspect}") if source_completed?

  consume # closing delimiter

  # literal value omits delimiters and includes un-escaped values
  literal = literal_chars.join

  # lexeme value includes delimiters and literal escape characters
  lexeme = source[lexeme_start_p..(next_p - 1)]

  Token.new(:string, lexeme, literal, current_location)
end

#lex_identifier(ignore_keywords: false) ⇒ Object

Consume an alphanumeric string. If ignore_keywords, the result is always an :identifier token. Otherwise, keywords and function names will be recognized and tokenized as those types.



384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
# File 'lib/janeway/lexer.rb', line 384

def lex_identifier(ignore_keywords: false)
  consume while alpha_numeric?(lookahead)

  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier) && !ignore_keywords
      identifier.to_sym
    elsif FUNCTIONS.include?(identifier) && !ignore_keywords
      :function
    else
      :identifier
    end

  Token.new(type, identifier, identifier, current_location)
end

#lex_member_name_shorthand(ignore_keywords: false) ⇒ Token

Lex a member name that is found within dot notation.

Recognize keywords and given them the correct type.



483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
# File 'lib/janeway/lexer.rb', line 483

def lex_member_name_shorthand(ignore_keywords: false)
  consume while name_char?(lookahead)
  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier) && !ignore_keywords
      identifier.to_sym
    elsif FUNCTIONS.include?(identifier) && !ignore_keywords
      :function
    else
      :identifier
    end
  if type == :function && WHITESPACE.include?(lookahead)
    raise err("Function name \"#{identifier}\" must not be followed by whitespace")
  end

  Token.new(type, identifier, identifier, current_location)
end

#lex_numberObject

Consume a numeric string. May be an integer, fractional, or exponent.

number = (int / "-0") [ frac ] [ exp ] ; decimal number
frac   = "." 1*DIGIT                   ; decimal fraction
exp    = "e" [ "-" / "+" ] 1*DIGIT     ; decimal exponent


343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
# File 'lib/janeway/lexer.rb', line 343

def lex_number
  consume_digits

  # Look for a fractional part
  if lookahead == '.' && digit?(lookahead(2))
    consume # "."
    consume_digits
  end

  # Look for an exponent part
  if 'Ee'.include?(lookahead)
    consume # "e", "E"
    if %w[+ -].include?(lookahead)
      consume # "+" / "-"
    end
    unless digit?(lookahead)
      lexeme = source[lexeme_start_p..(next_p - 1)]
      raise err("Exponent 'e' must be followed by number: #{lexeme.inspect}")
    end
    consume_digits
  end

  lexeme = source[lexeme_start_p..(next_p - 1)]
  if lexeme.start_with?('0') && lexeme.size > 1
    raise err("Number may not start with leading zero: #{lexeme.inspect}")
  end

  literal =
    if lexeme.include?('.') || lexeme.downcase.include?('e')
      lexeme.to_f
    else
      lexeme.to_i
    end
  Token.new(:number, lexeme, literal, current_location)
end

#lex_unescaped_identifierToken

Parse an identifier string which is not within delimiters. The standard set of unicode code points are allowed. No character escapes are allowed. Keywords and function names are ignored in this context.



405
406
407
408
409
# File 'lib/janeway/lexer.rb', line 405

def lex_unescaped_identifier
  consume while unescaped?(lookahead)
  identifier = source[lexeme_start_p..(next_p - 1)]
  Token.new(:identifier, identifier, identifier, current_location)
end

#lookahead(offset = 1) ⇒ Object



120
121
122
123
124
125
# File 'lib/janeway/lexer.rb', line 120

def lookahead(offset = 1)
  lookahead_p = (next_p - 1) + offset
  return "\0" if lookahead_p >= source.length

  source[lookahead_p]
end

#low_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “low-surrogate”



311
312
313
314
315
# File 'lib/janeway/lexer.rb', line 311

def low_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[DC DD DE DF].include?(hex_digits[0..1].upcase)
end

#name_char?(char) ⇒ Boolean

True if character is acceptable in a name selector using shorthand notation (ie. no bracket notation.) This is the same set as #name_first_char? except that it also allows numbers



469
470
471
472
473
474
# File 'lib/janeway/lexer.rb', line 469

def name_char?(char)
  NAME_FIRST.include?(char) \
    || DIGITS.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#name_first_char?(char) ⇒ Boolean

True if character is suitable as the first character in a name selector using shorthand notation (ie. no bracket notation.)

Defined in RFC9535 by this ABNF grammar: name-first = ALPHA /

"_"   /
%x80-D7FF /
   ; skip surrogate code points
%xE000-10FFFF


459
460
461
462
463
# File 'lib/janeway/lexer.rb', line 459

def name_first_char?(char)
  NAME_FIRST.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#source_completed?Boolean



501
502
503
# File 'lib/janeway/lexer.rb', line 501

def source_completed?
  next_p >= source.length # our pointer starts at 0, so the last char is length - 1.
end

#source_uncompleted?Boolean



505
506
507
# File 'lib/janeway/lexer.rb', line 505

def source_uncompleted?
  !source_completed?
end

#start_tokenizationObject



74
75
76
77
78
79
80
81
# File 'lib/janeway/lexer.rb', line 74

def start_tokenization
  if WHITESPACE.include?(@source[0]) || WHITESPACE.include?(@source[-1])
    raise err('JSONPath query may not start or end with whitespace')
  end

  tokenize while source_uncompleted?
  tokens << Token.new(:eof, '', nil, after_source_end_location)
end

#token_from_one_char_lex(lexeme) ⇒ Object



127
128
129
130
131
132
133
# File 'lib/janeway/lexer.rb', line 127

def token_from_one_char_lex(lexeme)
  if %w[. -].include?(lexeme) && WHITESPACE.include?(lookahead)
    raise err("Operator #{lexeme.inspect} must not be followed by whitespace")
  end

  Token.new(OPERATORS.key(lexeme), lexeme, nil, current_location)
end

#token_from_one_or_two_char_lex(lexeme) ⇒ Token

Consumes an operator that could be either 1 or 2 chars in length



137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/janeway/lexer.rb', line 137

def token_from_one_or_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  if TWO_CHAR_LEX.include?(next_two_chars)
    consume
    if next_two_chars == '..' && WHITESPACE.include?(lookahead)
      raise err("Operator #{next_two_chars.inspect} must not be followed by whitespace")
    end
    Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
  else
    token_from_one_char_lex(lexeme)
  end
end

#token_from_two_char_lex(lexeme) ⇒ Token

Consumes a 2 char operator



152
153
154
155
156
157
158
159
160
# File 'lib/janeway/lexer.rb', line 152

def token_from_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  unless TWO_CHAR_LEX.include?(next_two_chars)
    raise err("Unknown operator \"#{lexeme}\"")
  end

  consume
  Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
end

#tokenizeObject

Read a token from the @source, increment the pointers.



84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/janeway/lexer.rb', line 84

def tokenize
  self.lexeme_start_p = next_p

  c = consume
  return if WHITESPACE.include?(c)

  token =
    if ONE_OR_TWO_CHAR_LEX.include?(c)
      token_from_one_or_two_char_lex(c)
    elsif ONE_CHAR_LEX.include?(c)
      token_from_one_char_lex(c)
    elsif TWO_CHAR_LEX_FIRST.include?(c)
      token_from_two_char_lex(c)
    elsif %w[" '].include?(c)
      lex_delimited_string(c)
    elsif digit?(c)
      lex_number
    elsif name_first_char?(c)
      lex_member_name_shorthand(ignore_keywords: tokens.last&.type == :dot)
    end

  if token
    tokens << token
  else
    raise err("Unknown character: #{c.inspect}")
  end
end

#unescaped?(char) ⇒ Boolean

Return true if string matches the definition of “unescaped” from RFC9535: unescaped = %x20-21 / ; see RFC 8259

   ; omit 0x22 "
%x23-26 /
   ; omit 0x27 '
%x28-5B /
   ; omit 0x5C \
%x5D-D7FF /
   ; skip surrogate code points
%xE000-10FFFF


422
423
424
425
426
427
428
429
430
431
432
# File 'lib/janeway/lexer.rb', line 422

def unescaped?(char)
  case char.ord
  when 0x20..0x21 then true # space, "!"
  when 0x23..0x26 then true # "#", "$", "%"
  when 0x28..0x5B then true # "(" ... "["
  when 0x5D..0xD7FF then true # remaining ascii and lots of unicode code points
    # omit surrogate code points
  when 0xE000..0x10FFFF then true # much more unicode code points
  else false
  end
end