Class: Janeway::Lexer

Inherits:

Object

Object
Janeway::Lexer

show all

Defined in:: lib/janeway/lexer.rb

Overview

Transforms source code into tokens

Defined Under Namespace

Classes: Error

Constant Summary collapse

OPERATORS =

{
  and: '&&',
  array_slice_separator: ':',
  child_end: ']',
  child_start: '[',
  current_node: '@',
  descendants: '..',
  dot: '.',
  equal: '==',
  filter: '?',
  greater_than: '>',
  greater_than_or_equal: '>=',
  group_end: ')',
  group_start: '(',
  less_than: '<',
  less_than_or_equal: '<=',
  minus: '-',
  not: '!',
  not_equal: '!=',
  or: '||',
  root: '$',
  union: ',',
  wildcard: '*',
}.freeze

ONE_CHAR_LEX =

OPERATORS.values.select { |lexeme| lexeme.size == 1 }.freeze

TWO_CHAR_LEX =

OPERATORS.values.select { |lexeme| lexeme.size == 2 }.freeze

TWO_CHAR_LEX_FIRST =

TWO_CHAR_LEX.map { |lexeme| lexeme[0] }.freeze

ONE_OR_TWO_CHAR_LEX =

ONE_CHAR_LEX & TWO_CHAR_LEX.map { |str| str[0] }.freeze

WHITESPACE =

" \t\n\r"

KEYWORD =

%w[true false null].freeze

FUNCTIONS =

%w[length count match search value].freeze

ALPHABET = faster to check membership in a string than an array of char (benchmarked ruby 3.1.2)

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

DIGITS =

'0123456789'

NAME_FIRST = chars that may be used as the first letter of member-name-shorthand

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_'

Instance Attribute Summary collapse

#lexeme_start_p ⇒ Object

Returns the value of attribute lexeme_start_p.
#next_p ⇒ Object

Returns the value of attribute next_p.
#source ⇒ Object readonly

Returns the value of attribute source.
#tokens ⇒ Object readonly

Returns the value of attribute tokens.

Class Method Summary collapse

.lex(query) ⇒ Array<Token>

Tokenize and return the token list.

Instance Method Summary collapse

#after_source_end_location ⇒ Object
#alpha_numeric?(lexeme) ⇒ Boolean
#consume ⇒ Object
#consume_digits ⇒ Object
#consume_escape_sequence ⇒ String

Read escape char literals, and transform them into the described character.
#consume_four_hex_digits ⇒ String

Consume and return 4 hex digits from the source.
#consume_unicode_escape_sequence ⇒ String

Consume a unicode escape that matches this ABNF grammar: www.rfc-editor.org/rfc/rfc9535.html#section-2.3.1.1-2.
#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.
#current_location ⇒ Object
#digit?(lexeme) ⇒ Boolean
#err(msg) ⇒ Lexer::Error

Return a Lexer::Error with the specified message, include the query and location.
#escapable?(char) ⇒ Boolean
#high_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “high-surrogate”.
#initialize(source) ⇒ Lexer constructor

A new instance of Lexer.
#lex_delimited_string(delimiter) ⇒ Token

String token.
#lex_identifier(ignore_keywords: false) ⇒ Object

Consume an alphanumeric string.
#lex_member_name_shorthand(ignore_keywords: false) ⇒ Token

Lex a member name that is found within dot notation.
#lex_number ⇒ Object

Consume a numeric string.
#lex_unescaped_identifier ⇒ Token

Parse an identifier string which is not within delimiters.
#lookahead(offset = 1) ⇒ Object
#low_surrogate?(hex_digits) ⇒ Boolean

Return true if the given 4 char hex string is “low-surrogate”.
#name_char?(char) ⇒ Boolean

True if character is acceptable in a name selector using shorthand notation (ie. no bracket notation.) This is the same set as #name_first_char? except that it also allows numbers.
#name_first_char?(char) ⇒ Boolean

True if character is suitable as the first character in a name selector using shorthand notation (ie. no bracket notation.).
#source_completed? ⇒ Boolean
#source_uncompleted? ⇒ Boolean
#start_tokenization ⇒ Object
#token_from_one_char_lex(lexeme) ⇒ Object
#token_from_one_or_two_char_lex(lexeme) ⇒ Token

Consumes an operator that could be either 1 or 2 chars in length.
#token_from_two_char_lex(lexeme) ⇒ Token

Consumes a 2 char operator.
#tokenize ⇒ Object

Read a token from the @source, increment the pointers.
#unescaped?(char) ⇒ Boolean

Return true if string matches the definition of “unescaped” from RFC9535: unescaped = %x20-21 / ; see RFC 8259 ; omit 0x22 “ %x23-26 / ; omit 0x27 ‘ %x28-5B / ; omit 0x5C \ %x5D-D7FF / ; skip surrogate code points %xE000-10FFFF.

Constructor Details

#initialize(source) ⇒ `Lexer`

Returns a new instance of Lexer.

# File 'lib/janeway/lexer.rb', line 67

def initialize(source)
  @source = source
  @tokens = []
  @next_p = 0
  @lexeme_start_p = 0
end

Instance Attribute Details

#lexeme_start_p ⇒ `Object`

Returns the value of attribute lexeme_start_p.



53
54
55

# File 'lib/janeway/lexer.rb', line 53

def lexeme_start_p
  @lexeme_start_p
end

#next_p ⇒ `Object`

Returns the value of attribute next_p.



53
54
55

# File 'lib/janeway/lexer.rb', line 53

def next_p
  @next_p
end

#source ⇒ `Object` (readonly)

Returns the value of attribute source.



52
53
54

# File 'lib/janeway/lexer.rb', line 52

def source
  @source
end

#tokens ⇒ `Object` (readonly)

Returns the value of attribute tokens.



52
53
54

# File 'lib/janeway/lexer.rb', line 52

def tokens
  @tokens
end

Class Method Details

.lex(query) ⇒ `Array<Token>`

Tokenize and return the token list.

Parameters:

query (String) —

jsonpath query

Returns:

(Array<Token>)

Raises:

(ArgumentError)

# File 'lib/janeway/lexer.rb', line 59

def self.lex(query)
  raise ArgumentError, "expect string, got #{query.inspect}" unless query.is_a?(String)

  lexer = new(query)
  lexer.start_tokenization
  lexer.tokens
end

Instance Method Details

#after_source_end_location ⇒ `Object`



515
516
517

# File 'lib/janeway/lexer.rb', line 515

def after_source_end_location
  Location.new(next_p, 1)
end

#alpha_numeric?(lexeme) ⇒ `Boolean`

Returns:

(Boolean)



114
115
116

# File 'lib/janeway/lexer.rb', line 114

def alpha_numeric?(lexeme)
  ALPHABET.include?(lexeme) || DIGITS.include?(lexeme)
end

#consume ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 159

def consume
  c = lookahead
  @next_p += 1
  c
end

#consume_digits ⇒ `Object`



165
166
167

# File 'lib/janeway/lexer.rb', line 165

def consume_digits
  consume while digit?(lookahead)
end

#consume_escape_sequence ⇒ `String`

Read escape char literals, and transform them into the described character

Returns:

(String) —

single character (possibly multi-byte)

# File 'lib/janeway/lexer.rb', line 214

def consume_escape_sequence
  raise err('Expect escape sequence') unless consume == '\\'

  char = consume
  case char
  when 'b' then "\b"
  when 'f' then "\f"
  when 'n' then "\n"
  when 'r' then "\r"
  when 't' then "\t"
  when '/', '\\', '"', "'" then char
  when 'u' then consume_unicode_escape_sequence
  else
    raise err("Character #{char} must not be escaped") if unescaped?(char)

    # whatever this is, it is not allowed even when escaped
    raise err("Invalid character #{char.inspect}")
  end
end

#consume_four_hex_digits ⇒ `String`

Consume and return 4 hex digits from the source. Either upper or lower case is accepted. No judgment is made here on whether the resulting sequence is valid, as long as it is 4 hex digits.

Returns:

(String)

# File 'lib/janeway/lexer.rb', line 319

def consume_four_hex_digits
  hex_digits = []
  4.times do
    hex_digits << consume
    case hex_digits.last.ord
    when 0x30..0x39 then next # '0'..'1'
    when 0x40..0x46 then next # 'A'..'F'
    when 0x61..0x66 then next # 'a'..'f'
    else
      raise err("Invalid unicode escape sequence: \\u#{hex_digits.join}")
    end
  end
  raise err("Incomplete unicode escape sequence: \\u#{hex_digits.join}") if hex_digits.size < 4

  hex_digits.join
end

#consume_unicode_escape_sequence ⇒ `String`

Consume a unicode escape that matches this ABNF grammar: www.rfc-editor.org/rfc/rfc9535.html#section-2.3.1.1-2

hexchar             = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
non-surrogate       = ((DIGIT / "A"/"B"/"C" / "E"/"F") 3HEXDIG) /
                      ("D" %x30-37 2HEXDIG )
high-surrogate      = "D" ("8"/"9"/"A"/"B") 2HEXDIG
low-surrogate       = "D" ("C"/"D"/"E"/"F") 2HEXDIG

HEXDIG              = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

Both lower and uppercase are allowed. The grammar does now show this here but clarifies that in a following note.

The preceding ‘u` prefix has already been consumed.

Returns:

(String) —

single character (possibly multi-byte)

# File 'lib/janeway/lexer.rb', line 251

def consume_unicode_escape_sequence
  # return a non-surrogate sequence
  hex_str = consume_four_hex_digits
  return hex_str.hex.chr('UTF-8') unless hex_str.upcase.start_with?('D')

  # hex string starts with D, but is still non-surrogate
  return [hex_str.hex].pack('U') if '01234567'.include?(hex_str[1])

  # hex value is in the high-surrogate or low-surrogate range.

  if high_surrogate?(hex_str)
    # valid, as long as it is followed by \u low-surrogate
    prefix = [consume, consume].join
    hex_str2 = consume_four_hex_digits

    # This is a high-surrogate followed by a low-surrogate, which is valid.
    # This is the UTF-16 method of representing certain high unicode codepoints.
    # However this specific byte sequence is not a valid way to represent that same
    # unicode character in the UTF-8 encoding.
    # The surrogate pair must be converted into the correct UTF-8 code point.
    # This returns a UTF-8 string containing a single unicode character.
    return convert_surrogate_pair_to_codepoint(hex_str, hex_str2) if prefix == '\\u' && low_surrogate?(hex_str2)

    # Not allowed to have high surrogate that is not followed by low surrogate
    raise err("Invalid unicode escape sequence: \\u#{hex_str2}")

  end
  # Not allowed to have low surrogate that is not preceded by high surrogate
  raise err("Invalid unicode escape sequence: \\u#{hex_str}")
end

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ `String`

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.

Parameters:

high_surrogate_hex (String) —

string of hex digits, eg. “D83D”
low_surrogate_hex (String) —

string of hex digits, eg. “DE09”

Returns:

(String) —

UTF-8 string containing a single multi-byte unicode character, eg. “😉”

# File 'lib/janeway/lexer.rb', line 287

def convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex)
  [high_surrogate_hex, low_surrogate_hex].each do |hex_str|
    raise ArgumentError, "expect 4 hex digits, got #{hex_string.inspect}" unless hex_str.size == 4
  end

  # Calculate the code point from the surrogate pair values
  # algorithm from https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
  high = high_surrogate_hex.hex
  low = low_surrogate_hex.hex
  codepoint = ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
  [codepoint].pack('U') # convert integer codepoint to single character string
end

#current_location ⇒ `Object`



511
512
513

# File 'lib/janeway/lexer.rb', line 511

def current_location
  Location.new(lexeme_start_p, next_p - lexeme_start_p)
end

#digit?(lexeme) ⇒ `Boolean`

Returns:

(Boolean)



110
111
112

# File 'lib/janeway/lexer.rb', line 110

def digit?(lexeme)
  DIGITS.include?(lexeme)
end

#err(msg) ⇒ `Lexer::Error`

Return a Lexer::Error with the specified message, include the query and location

Parameters:

msg (String) —

error message

Returns:

(Lexer::Error)



523
524
525

# File 'lib/janeway/lexer.rb', line 523

def err(msg)
  Error.new(msg, @source, current_location)
end

#escapable?(char) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/janeway/lexer.rb', line 431

def escapable?(char)
  case char.ord
  when 0x62 then true # backspace
  when 0x66 then true # form feed
  when 0x6E then true # line feed
  when 0x72 then true # carriage return
  when 0x74 then true # horizontal tab
  when 0x2F then true # slash
  when 0x5C then true # backslash
  else false
  end
end

#high_surrogate?(hex_digits) ⇒ `Boolean`

Return true if the given 4 char hex string is “high-surrogate”

Returns:

(Boolean)

# File 'lib/janeway/lexer.rb', line 301

def high_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[D8 D9 DA DB].include?(hex_digits[0..1].upcase)
end

#lex_delimited_string(delimiter) ⇒ `Token`

Returns string token.

Parameters:

delimiter (String) —

eg. ‘ or “

Returns:

(Token) —

string token

# File 'lib/janeway/lexer.rb', line 171

def lex_delimited_string(delimiter)
  allowed_delimiters = %w[' "]
  # the "other" delimiter char, which is not currently being treated as a delimiter
  non_delimiter = allowed_delimiters.reject { |char| char == delimiter }.first

  literal_chars = []
  while lookahead != delimiter && source_uncompleted?
    # Transform escaped representation to literal chars
    next_char = lookahead
    literal_chars <<
      if next_char == '\\'
        if lookahead(2) == delimiter
          consume # \
          consume # delimiter
        elsif lookahead(2) == non_delimiter
          qtype = delimiter == '"' ? 'double' : 'single'
          raise err("Character #{non_delimiter} must not be escaped within #{qtype} quotes")
        else
          consume_escape_sequence # consumes multiple chars
        end
      elsif unescaped?(next_char)
        consume
      elsif allowed_delimiters.include?(next_char) && next_char != delimiter
        consume
      else
        raise err("invalid character #{next_char.inspect}")
      end
  end
  raise err("Unterminated string error: #{literal_chars.join.inspect}") if source_completed?

  consume # closing delimiter

  # literal value omits delimiters and includes un-escaped values
  literal = literal_chars.join

  # lexeme value includes delimiters and literal escape characters
  lexeme = source[lexeme_start_p..(next_p - 1)]

  Token.new(:string, lexeme, literal, current_location)
end

#lex_identifier(ignore_keywords: false) ⇒ `Object`

Consume an alphanumeric string. If ‘ignore_keywords`, the result is always an :identifier token. Otherwise, keywords and function names will be recognized and tokenized as those types.

Parameters:

ignore_keywords (Boolean) (defaults to: false)

# File 'lib/janeway/lexer.rb', line 381

def lex_identifier(ignore_keywords: false)
  consume while alpha_numeric?(lookahead)

  identifier = source[lexeme_start_p..(next_p - 1)]
  type =
    if KEYWORD.include?(identifier) && !ignore_keywords
      identifier.to_sym
    elsif FUNCTIONS.include?(identifier) && !ignore_keywords
      :function
    else
      :identifier
    end

  Token.new(type, identifier, identifier, current_location)
end

#lex_member_name_shorthand(ignore_keywords: false) ⇒ `Token`

Lex a member name that is found within dot notation. This name is not delimited and allows a subset of the characters that can appear in a delimited string.

Recognize keywords and given them the correct type.

Parameters:

ignore_keywords (Boolean) (defaults to: false)

Returns:

(Token)

#lex_number ⇒ `Object`

Consume a numeric string. May be an integer, fractional, or exponent.

number = (int / "-0") [ frac ] [ exp ] ; decimal number
frac   = "." 1*DIGIT                   ; decimal fraction
exp    = "e" [ "-" / "+" ] 1*DIGIT     ; decimal exponent

# File 'lib/janeway/lexer.rb', line 340

def lex_number
  consume_digits

  # Look for a fractional part
  if lookahead == '.' && digit?(lookahead(2))
    consume # "."
    consume_digits
  end

  # Look for an exponent part
  if 'Ee'.include?(lookahead)
    consume # "e", "E"
    if %w[+ -].include?(lookahead)
      consume # "+" / "-"
    end
    unless digit?(lookahead)
      lexeme = source[lexeme_start_p..(next_p - 1)]
      raise err("Exponent 'e' must be followed by number: #{lexeme.inspect}")
    end
    consume_digits
  end

  lexeme = source[lexeme_start_p..(next_p - 1)]
  if lexeme.start_with?('0') && lexeme.size > 1
    raise err("Number may not start with leading zero: #{lexeme.inspect}")
  end

  literal =
    if lexeme.include?('.') || lexeme.downcase.include?('e')
      lexeme.to_f
    else
      lexeme.to_i
    end
  Token.new(:number, lexeme, literal, current_location)
end

#lex_unescaped_identifier ⇒ `Token`

Parse an identifier string which is not within delimiters. The standard set of unicode code points are allowed. No character escapes are allowed. Keywords and function names are ignored in this context.

Returns:

(Token)

# File 'lib/janeway/lexer.rb', line 402

def lex_unescaped_identifier
  consume while unescaped?(lookahead)
  identifier = source[lexeme_start_p..(next_p - 1)]
  Token.new(:identifier, identifier, identifier, current_location)
end

#lookahead(offset = 1) ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 118

def lookahead(offset = 1)
  lookahead_p = (next_p - 1) + offset
  return "\0" if lookahead_p >= source.length

  source[lookahead_p]
end

#low_surrogate?(hex_digits) ⇒ `Boolean`

Return true if the given 4 char hex string is “low-surrogate”

Returns:

(Boolean)

# File 'lib/janeway/lexer.rb', line 308

def low_surrogate?(hex_digits)
  return false unless hex_digits.size == 4

  %w[DC DD DE DF].include?(hex_digits[0..1].upcase)
end

#name_char?(char) ⇒ `Boolean`

True if character is acceptable in a name selector using shorthand notation (ie. no bracket notation.) This is the same set as #name_first_char? except that it also allows numbers

Parameters:

char (String) —

single character, possibly multi-byte

Returns:

(Boolean)

# File 'lib/janeway/lexer.rb', line 466

def name_char?(char)
  NAME_FIRST.include?(char) \
    || DIGITS.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#name_first_char?(char) ⇒ `Boolean`

True if character is suitable as the first character in a name selector using shorthand notation (ie. no bracket notation.)

Defined in RFC9535 by this ABNF grammar: name-first = ALPHA /

"_"   /
%x80-D7FF /
   ; skip surrogate code points
%xE000-10FFFF

Parameters:

char (String) —

single character, possibly multi-byte

Returns:

(Boolean)

# File 'lib/janeway/lexer.rb', line 456

def name_first_char?(char)
  NAME_FIRST.include?(char) \
    || (0x80..0xD7FF).cover?(char.ord) \
    || (0xE000..0x10FFFF).cover?(char.ord)
end

#source_completed? ⇒ `Boolean`

Returns:

(Boolean)



503
504
505

# File 'lib/janeway/lexer.rb', line 503

def source_completed?
  next_p >= source.length # our pointer starts at 0, so the last char is length - 1.
end

#source_uncompleted? ⇒ `Boolean`

Returns:

(Boolean)



507
508
509

# File 'lib/janeway/lexer.rb', line 507

def source_uncompleted?
  !source_completed?
end

#start_tokenization ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 74

def start_tokenization
  raise err('JSONPath query is empty') if @source.empty?
  if WHITESPACE.include?(@source[0]) || WHITESPACE.include?(@source[-1])
    raise err('JSONPath query may not start or end with whitespace')
  end

  tokenize while source_uncompleted?
  tokens << Token.new(:eof, '', nil, after_source_end_location)
end

#token_from_one_char_lex(lexeme) ⇒ `Object`

# File 'lib/janeway/lexer.rb', line 125

def token_from_one_char_lex(lexeme)
  if %w[. -].include?(lexeme) && WHITESPACE.include?(lookahead)
    raise err("Operator #{lexeme.inspect} must not be followed by whitespace")
  end

  Token.new(OPERATORS.key(lexeme), lexeme, nil, current_location)
end

#token_from_one_or_two_char_lex(lexeme) ⇒ `Token`

Consumes an operator that could be either 1 or 2 chars in length

Returns:

(Token)

# File 'lib/janeway/lexer.rb', line 135

def token_from_one_or_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  if TWO_CHAR_LEX.include?(next_two_chars)
    consume
    if next_two_chars == '..' && WHITESPACE.include?(lookahead)
      raise err("Operator #{next_two_chars.inspect} must not be followed by whitespace")
    end

    Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
  else
    token_from_one_char_lex(lexeme)
  end
end

#token_from_two_char_lex(lexeme) ⇒ `Token`

Consumes a 2 char operator

Returns:

(Token)

# File 'lib/janeway/lexer.rb', line 151

def token_from_two_char_lex(lexeme)
  next_two_chars = [lexeme, lookahead].join
  raise err("Unknown operator \"#{lexeme}\"") unless TWO_CHAR_LEX.include?(next_two_chars)

  consume
  Token.new(OPERATORS.key(next_two_chars), next_two_chars, nil, current_location)
end

#tokenize ⇒ `Object`

Read a token from the @source, increment the pointers.

# File 'lib/janeway/lexer.rb', line 85

def tokenize
  self.lexeme_start_p = next_p

  c = consume
  return if WHITESPACE.include?(c)

  token =
    if ONE_OR_TWO_CHAR_LEX.include?(c)
      token_from_one_or_two_char_lex(c)
    elsif ONE_CHAR_LEX.include?(c)
      token_from_one_char_lex(c)
    elsif TWO_CHAR_LEX_FIRST.include?(c)
      token_from_two_char_lex(c)
    elsif %w[" '].include?(c)
      lex_delimited_string(c)
    elsif digit?(c)
      lex_number
    elsif name_first_char?(c)
      lex_member_name_shorthand(ignore_keywords: tokens.last&.type == :dot)
    end
  raise err("Unknown character: #{c.inspect}") unless token

  tokens << token
end

#unescaped?(char) ⇒ `Boolean`

Return true if string matches the definition of “unescaped” from RFC9535: unescaped = %x20-21 / ; see RFC 8259

   ; omit 0x22 "
%x23-26 /
   ; omit 0x27 '
%x28-5B /
   ; omit 0x5C \
%x5D-D7FF /
   ; skip surrogate code points
%xE000-10FFFF

Parameters:

char (String) —

single character, possibly multi-byte

Returns:

(Boolean)

# File 'lib/janeway/lexer.rb', line 419

def unescaped?(char)
  case char.ord
  when 0x20..0x21 then true # space, "!"
  when 0x23..0x26 then true # "#", "$", "%"
  when 0x28..0x5B then true # "(" ... "["
  when 0x5D..0xD7FF then true # remaining ascii and lots of unicode code points
    # omit surrogate code points
  when 0xE000..0x10FFFF then true # much more unicode code points
  else false
  end
end

Class: Janeway::Lexer

Overview

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Lexer

Instance Attribute Details

#lexeme_start_p ⇒ Object

#next_p ⇒ Object

#source ⇒ Object (readonly)

#tokens ⇒ Object (readonly)

Class Method Details

.lex(query) ⇒ Array<Token>

Instance Method Details

#after_source_end_location ⇒ Object

#alpha_numeric?(lexeme) ⇒ Boolean

#consume ⇒ Object

#consume_digits ⇒ Object

#consume_escape_sequence ⇒ String

#consume_four_hex_digits ⇒ String

#consume_unicode_escape_sequence ⇒ String

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

#current_location ⇒ Object

#digit?(lexeme) ⇒ Boolean

#err(msg) ⇒ Lexer::Error

#escapable?(char) ⇒ Boolean

#high_surrogate?(hex_digits) ⇒ Boolean

#lex_delimited_string(delimiter) ⇒ Token

#lex_identifier(ignore_keywords: false) ⇒ Object

#lex_member_name_shorthand(ignore_keywords: false) ⇒ Token

#lex_number ⇒ Object

#lex_unescaped_identifier ⇒ Token

#lookahead(offset = 1) ⇒ Object

#low_surrogate?(hex_digits) ⇒ Boolean

#name_char?(char) ⇒ Boolean

#name_first_char?(char) ⇒ Boolean

#source_completed? ⇒ Boolean

#source_uncompleted? ⇒ Boolean

#start_tokenization ⇒ Object

#token_from_one_char_lex(lexeme) ⇒ Object

#token_from_one_or_two_char_lex(lexeme) ⇒ Token

#token_from_two_char_lex(lexeme) ⇒ Token

#tokenize ⇒ Object

#unescaped?(char) ⇒ Boolean

#initialize(source) ⇒ `Lexer`

#lexeme_start_p ⇒ `Object`

#next_p ⇒ `Object`

#source ⇒ `Object` (readonly)

#tokens ⇒ `Object` (readonly)

.lex(query) ⇒ `Array<Token>`

#after_source_end_location ⇒ `Object`

#alpha_numeric?(lexeme) ⇒ `Boolean`

#consume ⇒ `Object`

#consume_digits ⇒ `Object`

#consume_escape_sequence ⇒ `String`

#consume_four_hex_digits ⇒ `String`

#consume_unicode_escape_sequence ⇒ `String`

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ `String`

#current_location ⇒ `Object`

#digit?(lexeme) ⇒ `Boolean`

#err(msg) ⇒ `Lexer::Error`

#escapable?(char) ⇒ `Boolean`

#high_surrogate?(hex_digits) ⇒ `Boolean`

#lex_delimited_string(delimiter) ⇒ `Token`

#lex_identifier(ignore_keywords: false) ⇒ `Object`

#lex_member_name_shorthand(ignore_keywords: false) ⇒ `Token`

#lex_number ⇒ `Object`

#lex_unescaped_identifier ⇒ `Token`

#lookahead(offset = 1) ⇒ `Object`

#low_surrogate?(hex_digits) ⇒ `Boolean`

#name_char?(char) ⇒ `Boolean`

#name_first_char?(char) ⇒ `Boolean`

#source_completed? ⇒ `Boolean`

#source_uncompleted? ⇒ `Boolean`

#start_tokenization ⇒ `Object`

#token_from_one_char_lex(lexeme) ⇒ `Object`

#token_from_one_or_two_char_lex(lexeme) ⇒ `Token`

#token_from_two_char_lex(lexeme) ⇒ `Token`

#tokenize ⇒ `Object`

#unescaped?(char) ⇒ `Boolean`