Class: Janeway::Lexer

Inherits:

Object

Object
Janeway::Lexer

show all

Defined in:: lib/janeway/lexer.rb

Overview

Transforms source code into tokens

Defined Under Namespace

Classes: Error

Instance Attribute Summary collapse

#lexeme_start_p ⇒ Object

Returns the value of attribute lexeme_start_p.
#next_p ⇒ Object

Returns the value of attribute next_p.
#source ⇒ Object readonly

Returns the value of attribute source.
#tokens ⇒ Object readonly

Returns the value of attribute tokens.

Class Method Summary collapse

.lex(query) ⇒ Array<Token>

Tokenize and return the token list.

Instance Method Summary collapse

#after_source_end_location ⇒ Object
#alpha_numeric?(lexeme) ⇒ Boolean
#consume ⇒ Object
#consume_digits ⇒ Object
#consume_escape_sequence ⇒ String

Read escape char literals, and transform them into the described character.
#consume_four_hex_digits ⇒ String

Consume and return 4 hex digits from the source.
#consume_unicode_escape_sequence ⇒ String

Consume a unicode escape that matches this ABNF grammar: www.rfc-editor.org/rfc/rfc9535.html#section-2.3.1.1-2.
#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.
#current_location ⇒ Object
#digit?(lexeme) ⇒ Boolean
#err(msg) ⇒ Lexer::Error

Return a Lexer::Error with the specified message, include the query and location.
#escapable?(char) ⇒ Boolean
#high_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “high-surrogate”.
#initialize(source) ⇒ Lexer constructor

A new instance of Lexer.
#lex_delimited_string(delimiter) ⇒ Token

String token.
#lex_identifier(ignore_keywords: false) ⇒ Object

Consume an alphanumeric string.
#lex_member_name_shorthand(ignore_keywords: false) ⇒ Token

Lex a member name that is found within dot notation.
#lex_number ⇒ Object

Consume a numeric string.
#lex_unescaped_identifier ⇒ Token

Parse an identifier string which is not within delimiters.
#lookahead(offset = 1) ⇒ Object
#low_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “low-surrogate”.
#name_char?(char) ⇒ Boolean

True if character is acceptable in a name selector using shorthand notation (ie. no bracket notation.) This is the same set as #name_first_char? except that it also allows numbers.
#name_first_char?(char) ⇒ Boolean

True if character is suitable as the first character in a name selector using shorthand notation (ie. no bracket notation.).
#source_completed? ⇒ Boolean
#source_uncompleted? ⇒ Boolean
#start_tokenization ⇒ Object
#token_from_one_char_lex(lexeme) ⇒ Object
#token_from_one_or_two_char_lex(lexeme) ⇒ Token

Consumes an operator that could be either 1 or 2 chars in length.
#token_from_two_char_lex(lexeme) ⇒ Token

Consumes a 2 char operator.
#tokenize ⇒ Object

Read a token from the @source, increment the pointers.
#unescaped?(char) ⇒ Boolean

Return true if string matches the definition of “unescaped” from RFC9535: unescaped = %x20-21 / ; see RFC 8259 ; omit 0x22 “ %x23-26 / ; omit 0x27 ‘ %x28-5B / ; omit 0x5C \ %x5D-D7FF / ; skip surrogate code points %xE000-10FFFF.

Constructor Details

#initialize(source) ⇒ `Lexer`

# File 'lib/janeway/lexer.rb', line 67

def initialize(source)
  @source = source
  @tokens = []
  @next_p = 0
  @lexeme_start_p = 0
end

Instance Attribute Details

#lexeme_start_p ⇒ `Object`

Returns the value of attribute lexeme_start_p.



53
54
55

# File 'lib/janeway/lexer.rb', line 53

def lexeme_start_p
  @lexeme_start_p
end

#next_p ⇒ `Object`

Returns the value of attribute next_p.



53
54
55

# File 'lib/janeway/lexer.rb', line 53

def next_p
  @next_p
end

#source ⇒ `Object` (readonly)

Returns the value of attribute source.



52
53
54

# File 'lib/janeway/lexer.rb', line 52

def source
  @source
end

#tokens ⇒ `Object` (readonly)

Returns the value of attribute tokens.



52
53
54

# File 'lib/janeway/lexer.rb', line 52

def tokens
  @tokens
end

Class Method Details

.lex(query) ⇒ `Array<Token>`

Tokenize and return the token list.

Raises:

(ArgumentError)

# File 'lib/janeway/lexer.rb', line 59

def self.lex(query)
  raise ArgumentError, "expect string, got #{query.inspect}" unless query.is_a?(String)

  lexer = new(query)
  lexer.start_tokenization
  lexer.tokens
end

Instance Method Details

#after_source_end_location ⇒ `Object`



513
514
515

# File 'lib/janeway/lexer.rb', line 513

def after_source_end_location
  Location.new(next_p, 1)
end

#alpha_numeric?(lexeme) ⇒ `Boolean`



116
117
118

# File 'lib/janeway/lexer.rb', line 116

def alpha_numeric?(lexeme)
  ALPHABET.include?(lexeme) || DIGITS.include?(lexeme)
end

#consume ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 162

def consume
  c = lookahead
  @next_p += 1
  c
end

#consume_digits ⇒ `Object`



168
169
170

# File 'lib/janeway/lexer.rb', line 168

def consume_digits
  consume while digit?(lookahead)
end

#consume_escape_sequence ⇒ `String`

Read escape char literals, and transform them into the described character

# File 'lib/janeway/lexer.rb', line 215

def consume_escape_sequence
  raise err('Expect escape sequence') unless consume == '\\'

  char = consume
  case char
  when 'b' then "\b"
  when 'f' then "\f"
  when 'n' then "\n"
  when 'r' then "\r"
  when 't' then "\t"
  when '/', '\\', '"', "'" then char
  when 'u' then consume_unicode_escape_sequence
  else
    if unescaped?(char)
      raise err("Character #{char} must not be escaped")
    else
      # whatever this is, it is not allowed even when escaped
      raise err("Invalid character #{char.inspect}")
    end
  end
end

#consume_four_hex_digits ⇒ `String`

Consume and return 4 hex digits from the source. Either upper or lower case is accepted. No judgment is made here on whether the resulting sequence is valid, as long as it is 4 hex digits.

# File 'lib/janeway/lexer.rb', line 322

def consume_four_hex_digits
  hex_digits = []
  4.times do
    hex_digits << consume
    case hex_digits.last.ord
    when 0x30..0x39 then next # '0'..'1'
    when 0x40..0x46 then next # 'A'..'F'
    when 0x61..0x66 then next # 'a'..'f'
    else
      raise err("Invalid unicode escape sequence: \\u#{hex_digits.join}")
    end
  end
  raise err("Incomplete unicode escape sequence: \\u#{hex_digits.join}") if hex_digits.size < 4

  hex_digits.join
end

#consume_unicode_escape_sequence ⇒ `String`

Consume a unicode escape that matches this ABNF grammar: www.rfc-editor.org/rfc/rfc9535.html#section-2.3.1.1-2

hexchar             = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
non-surrogate       = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG) /
                      ("D" %x30-37 2HEXDIG )
high-surrogate      = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate       = "D" ("C"/"D"/"E"/"F") 2HEXDIG

HEXDIG              = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

Both lower and uppercase are allowed. The grammar does now show this here but clarifies that in a following note.

The preceding \u prefix has already been consumed.

# File 'lib/janeway/lexer.rb', line 254

def consume_unicode_escape_sequence
  # return a non-surrogate sequence
  hex_str = consume_four_hex_digits
  return hex_str.hex.chr('UTF-8') unless hex_str.upcase.start_with?('D')

  # hex string starts with D, but is still non-surrogate
  return [hex_str.hex].pack('U') if '01234567'.include?(hex_str[1])

  # hex value is in the high-surrogate or low-surrogate range.

  if high_surrogate?(hex_str)
    # valid, as long as it is followed by \u low-surrogate
    prefix = [consume, consume].join
    hex_str2 = consume_four_hex_digits

    # This is a high-surrogate followed by a low-surrogate, which is valid.
    # This is the UTF-16 method of representing certain high unicode codepoints.
    # However this specific byte sequence is not a valid way to represent that same
    # unicode character in the UTF-8 encoding.
    # The surrogate pair must be converted into the correct UTF-8 code point.
    # This returns a UTF-8 string containing a single unicode character.
    return convert_surrogate_pair_to_codepoint(hex_str, hex_str2) if prefix == '\\u' && low_surrogate?(hex_str2)

    # Not allowed to have high surrogate that is not followed by low surrogate
    raise err("Invalid unicode escape sequence: \\u#{hex_str2}")

  end
  # Not allowed to have low surrogate that is not preceded by high surrogate
  raise err("Invalid unicode escape sequence: \\u#{hex_str}")
end

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ `String`

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.

# File 'lib/janeway/lexer.rb', line 290

def convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex)
  [high_surrogate_hex, low_surrogate_hex].each do |hex_str|
    raise ArgumentError, "expect 4 hex digits, got #{hex_string.inspect}" unless hex_str.size == 4
  end

  # Calculate the code point from the surrogate pair values
  # algorithm from https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
  high = high_surrogate_hex.hex
  low = low_surrogate_hex.hex
  codepoint = ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
  [codepoint].pack('U') # convert integer codepoint to single character string
end

#current_location ⇒ `Object`



509
510
511

# File 'lib/janeway/lexer.rb', line 509

def current_location
  Location.new(lexeme_start_p, next_p - lexeme_start_p)
end

#digit?(lexeme) ⇒ `Boolean`



112
113
114

# File 'lib/janeway/lexer.rb', line 112

def digit?(lexeme)
  DIGITS.include?(lexeme)
end

#err(msg) ⇒ `Lexer::Error`

Return a Lexer::Error with the specified message, include the query and location



521
522
523

# File 'lib/janeway/lexer.rb', line 521

def err(msg)
  Error.new(msg, @source, current_location)
end

#escapable?(char) ⇒ `Boolean`

# File 'lib/janeway/lexer.rb', line 434

def escapable?(char)
  case char.ord
  when 0x62 then true # backspace
  when 0x66 then true # form feed
  when 0x6E then true # line feed
  when 0x72 then true # carriage return
  when 0x74 then true # horizontal tab
  when 0x2F then true # slash
  when 0x5C then true # backslash
  else false
  end
end

#high_surrogate?(hex_digits) ⇒ `Boolean`

Return true if the given 4 char hex string is “high-surrogate”

# File 'lib/janeway/lexer.rb', line 304

def high_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[D8 D9 DA DB].include?(hex_digits[0..1].upcase)
end

#lex_delimited_string(delimiter) ⇒ `Token`

# File 'lib/janeway/lexer.rb', line 174

def lex_delimited_string(delimiter)
  non_delimiter = %w[' "].reject { _1 == delimiter }.first

  literal_chars = []
  while lookahead != delimiter && source_uncompleted?
    # Transform escaped representation to literal chars
    next_char = lookahead
    literal_chars <<
      if next_char == '\\'
        if lookahead(2) == delimiter
          consume # \
          consume # delimiter
        elsif lookahead(2) == non_delimiter
          qtype = delimiter == '"' ? 'double' : 'single'
          raise err("Character #{non_delimiter} must not be escaped within #{qtype} quotes")
        else
          consume_escape_sequence # consumes multiple chars
        end
      elsif unescaped?(next_char)
        consume
      elsif %w[' "].include?(next_char) && next_char != delimiter
        consume
      else
        raise err("invalid character #{next_char.inspect}")
      end
  end
  raise err("Unterminated string error: #{literal_chars.join.inspect}") if source_completed?

  consume # closing delimiter

  # literal value omits delimiters and includes un-escaped values
  literal = literal_chars.join

  # lexeme value includes delimiters and literal escape characters
  lexeme = source[lexeme_start_p..(next_p - 1)]

  Token.new(:string, lexeme, literal, current_location)
end

#lex_identifier(ignore_keywords: false) ⇒ `Object`

Consume an alphanumeric string. If ignore_keywords, the result is always an :identifier token. Otherwise, keywords and function names will be recognized and tokenized as those types.

# File 'lib/janeway/lexer.rb', line 384

def lex_identifier(ignore_keywords: false)
  consume while alpha_numeric?(lookahead)

  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier) && !ignore_keywords
      identifier.to_sym
    elsif FUNCTIONS.include?(identifier) && !ignore_keywords
      :function
    else
      :identifier
    end

  Token.new(type, identifier, identifier, current_location)
end

#lex_member_name_shorthand(ignore_keywords: false) ⇒ `Token`

Lex a member name that is found within dot notation.

Recognize keywords and given them the correct type.

#lex_number ⇒ `Object`

Consume a numeric string. May be an integer, fractional, or exponent.

number = (int / "-0") [ frac ] [ exp ] ; decimal number
frac   = "." 1*DIGIT                   ; decimal fraction
exp    = "e" [ "-" / "+" ] 1*DIGIT     ; decimal exponent

# File 'lib/janeway/lexer.rb', line 343

def lex_number
  consume_digits

  # Look for a fractional part
  if lookahead == '.' && digit?(lookahead(2))
    consume # "."
    consume_digits
  end

  # Look for an exponent part
  if 'Ee'.include?(lookahead)
    consume # "e", "E"
    if %w[+ -].include?(lookahead)
      consume # "+" / "-"
    end
    unless digit?(lookahead)
      lexeme = source[lexeme_start_p..(next_p - 1)]
      raise err("Exponent 'e' must be followed by number: #{lexeme.inspect}")
    end
    consume_digits
  end

  lexeme = source[lexeme_start_p..(next_p - 1)]
  if lexeme.start_with?('0') && lexeme.size > 1
    raise err("Number may not start with leading zero: #{lexeme.inspect}")
  end

  literal =
    if lexeme.include?('.') || lexeme.downcase.include?('e')
      lexeme.to_f
    else
      lexeme.to_i
    end
  Token.new(:number, lexeme, literal, current_location)
end

#lex_unescaped_identifier ⇒ `Token`

Parse an identifier string which is not within delimiters. The standard set of unicode code points are allowed. No character escapes are allowed. Keywords and function names are ignored in this context.

# File 'lib/janeway/lexer.rb', line 405

def lex_unescaped_identifier
  consume while unescaped?(lookahead)
  identifier = source[lexeme_start_p..(next_p - 1)]
  Token.new(:identifier, identifier, identifier, current_location)
end

#lookahead(offset = 1) ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 120

def lookahead(offset = 1)
  lookahead_p = (next_p - 1) + offset
  return "\0" if lookahead_p >= source.length

  source[lookahead_p]
end

#low_surrogate?(hex_digits) ⇒ `Boolean`

Return true if the given 4 char hex string is “low-surrogate”

# File 'lib/janeway/lexer.rb', line 311

def low_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[DC DD DE DF].include?(hex_digits[0..1].upcase)
end

#name_char?(char) ⇒ `Boolean`

True if character is acceptable in a name selector using shorthand notation (ie. no bracket notation.) This is the same set as #name_first_char? except that it also allows numbers

# File 'lib/janeway/lexer.rb', line 469

def name_char?(char)
  NAME_FIRST.include?(char) \
    || DIGITS.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#name_first_char?(char) ⇒ `Boolean`

True if character is suitable as the first character in a name selector using shorthand notation (ie. no bracket notation.)

Defined in RFC9535 by this ABNF grammar: name-first = ALPHA /

"_"   /
%x80-D7FF /
   ; skip surrogate code points
%xE000-10FFFF

# File 'lib/janeway/lexer.rb', line 459

def name_first_char?(char)
  NAME_FIRST.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#source_completed? ⇒ `Boolean`



501
502
503

# File 'lib/janeway/lexer.rb', line 501

def source_completed?
  next_p >= source.length # our pointer starts at 0, so the last char is length - 1.
end

#source_uncompleted? ⇒ `Boolean`



505
506
507

# File 'lib/janeway/lexer.rb', line 505

def source_uncompleted?
  !source_completed?
end

#start_tokenization ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 74

def start_tokenization
  if WHITESPACE.include?(@source[0]) || WHITESPACE.include?(@source[-1])
    raise err('JSONPath query may not start or end with whitespace')
  end

  tokenize while source_uncompleted?
  tokens << Token.new(:eof, '', nil, after_source_end_location)
end

#token_from_one_char_lex(lexeme) ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 127

def token_from_one_char_lex(lexeme)
  if %w[. -].include?(lexeme) && WHITESPACE.include?(lookahead)
    raise err("Operator #{lexeme.inspect} must not be followed by whitespace")
  end

  Token.new(OPERATORS.key(lexeme), lexeme, nil, current_location)
end

#token_from_one_or_two_char_lex(lexeme) ⇒ `Token`

Consumes an operator that could be either 1 or 2 chars in length

# File 'lib/janeway/lexer.rb', line 137

def token_from_one_or_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  if TWO_CHAR_LEX.include?(next_two_chars)
    consume
    if next_two_chars == '..' && WHITESPACE.include?(lookahead)
      raise err("Operator #{next_two_chars.inspect} must not be followed by whitespace")
    end
    Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
  else
    token_from_one_char_lex(lexeme)
  end
end

#token_from_two_char_lex(lexeme) ⇒ `Token`

Consumes a 2 char operator

# File 'lib/janeway/lexer.rb', line 152

def token_from_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  unless TWO_CHAR_LEX.include?(next_two_chars)
    raise err("Unknown operator \"#{lexeme}\"")
  end

  consume
  Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
end

#tokenize ⇒ `Object`

Read a token from the @source, increment the pointers.

# File 'lib/janeway/lexer.rb', line 84

def tokenize
  self.lexeme_start_p = next_p

  c = consume
  return if WHITESPACE.include?(c)

  token =
    if ONE_OR_TWO_CHAR_LEX.include?(c)
      token_from_one_or_two_char_lex(c)
    elsif ONE_CHAR_LEX.include?(c)
      token_from_one_char_lex(c)
    elsif TWO_CHAR_LEX_FIRST.include?(c)
      token_from_two_char_lex(c)
    elsif %w[" '].include?(c)
      lex_delimited_string(c)
    elsif digit?(c)
      lex_number
    elsif name_first_char?(c)
      lex_member_name_shorthand(ignore_keywords: tokens.last&.type == :dot)
    end

  if token
    tokens << token
  else
    raise err("Unknown character: #{c.inspect}")
  end
end

#unescaped?(char) ⇒ `Boolean`

Return true if string matches the definition of “unescaped” from RFC9535: unescaped = %x20-21 / ; see RFC 8259

   ; omit 0x22 "
%x23-26 /
   ; omit 0x27 '
%x28-5B /
   ; omit 0x5C \
%x5D-D7FF /
   ; skip surrogate code points
%xE000-10FFFF

# File 'lib/janeway/lexer.rb', line 422

def unescaped?(char)
  case char.ord
  when 0x20..0x21 then true # space, "!"
  when 0x23..0x26 then true # "#", "$", "%"
  when 0x28..0x5B then true # "(" ... "["
  when 0x5D..0xD7FF then true # remaining ascii and lots of unicode code points
    # omit surrogate code points
  when 0xE000..0x10FFFF then true # much more unicode code points
  else false
  end
end

Class: Janeway::Lexer

Overview

Defined Under Namespace

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Lexer

Instance Attribute Details

#lexeme_start_p ⇒ Object

#next_p ⇒ Object

#source ⇒ Object (readonly)

#tokens ⇒ Object (readonly)

Class Method Details

.lex(query) ⇒ Array<Token>

Instance Method Details

#after_source_end_location ⇒ Object

#alpha_numeric?(lexeme) ⇒ Boolean

#consume ⇒ Object

#consume_digits ⇒ Object

#consume_escape_sequence ⇒ String

#consume_four_hex_digits ⇒ String

#consume_unicode_escape_sequence ⇒ String

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

#current_location ⇒ Object

#digit?(lexeme) ⇒ Boolean

#err(msg) ⇒ Lexer::Error

#escapable?(char) ⇒ Boolean

#high_surrogate?(hex_digits) ⇒ Boolean

#lex_delimited_string(delimiter) ⇒ Token

#lex_identifier(ignore_keywords: false) ⇒ Object

#lex_member_name_shorthand(ignore_keywords: false) ⇒ Token

#lex_number ⇒ Object

#lex_unescaped_identifier ⇒ Token

#lookahead(offset = 1) ⇒ Object

#low_surrogate?(hex_digits) ⇒ Boolean

#name_char?(char) ⇒ Boolean

#name_first_char?(char) ⇒ Boolean

#source_completed? ⇒ Boolean

#source_uncompleted? ⇒ Boolean

#start_tokenization ⇒ Object

#token_from_one_char_lex(lexeme) ⇒ Object

#token_from_one_or_two_char_lex(lexeme) ⇒ Token

#token_from_two_char_lex(lexeme) ⇒ Token

#tokenize ⇒ Object

#unescaped?(char) ⇒ Boolean