Class: EBNF::LL1::Lexer

Inherits:
Object
  • Object
show all
Includes:
Unescape, Enumerable
Defined in:
lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = EBNF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = EBNF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  EBNF::LL1::Lexer.tokenize(query)
rescue EBNF::LL1::Lexer::Error => error
  warn error.inspect
end

See Also:

Defined Under Namespace

Classes: Error, Terminal, Token

Constant Summary

Constants included from Unescape

Unescape::ECHAR, Unescape::ESCAPE_CHAR4, Unescape::ESCAPE_CHAR8, Unescape::ESCAPE_CHARS, Unescape::UCHAR

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Unescape

unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ Lexer

Initializes a new lexer instance.

Parameters:

  • input (String, #to_s) (defaults to: nil)
  • terminals (Array<Array<Symbol, Regexp>, Terminal>) (defaults to: nil)

    Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.

  • options (Hash{Symbol => Object})
  • options[Integer] (Hash)

    a customizable set of options

Options Hash (**options):

  • :whitespace (Regexp)

    Whitespace between tokens, including comments

Raises:



94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/ebnf/ll1/lexer.rb', line 94

def initialize(input = nil, terminals = nil, **options)
  @options        = options.dup
  @whitespace     = @options[:whitespace]
  @terminals      = terminals.map do |term|
    if term.is_a?(Array) && term.length ==3
      # Last element is options
      Terminal.new(term[0], term[1], **term[2])
    elsif term.is_a?(Array)
      Terminal.new(*term)
    else
      term
    end
  end

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @scanner = Scanner.new(input, **options)
end

Instance Attribute Details

#inputString

The current input string being processed.

Returns:

  • (String)


123
124
125
# File 'lib/ebnf/ll1/lexer.rb', line 123

def input
  @input
end

#optionsHash (readonly)

Any additional options for the lexer.

Returns:

  • (Hash)


117
118
119
# File 'lib/ebnf/ll1/lexer.rb', line 117

def options
  @options
end

#whitespaceRegexp (readonly)

Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Returns:

  • (Regexp)

    defines whitespace, including comments, otherwise whitespace must be explicit in terminals



39
40
41
# File 'lib/ebnf/ll1/lexer.rb', line 39

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Tokenizes the given ‘input` string or stream.

Parameters:

  • input (String, #to_s)
  • terminals (Array<Array<Symbol, Regexp>>)

    Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.

  • options (Hash{Symbol => Object})

Yields:

  • (lexer)

Yield Parameters:

Returns:

Raises:



77
78
79
80
# File 'lib/ebnf/ll1/lexer.rb', line 77

def self.tokenize(input, terminals, **options, &block)
  lexer = self.new(input, terminals, **options)
  block_given? ? block.call(lexer) : lexer
end

.unescape_codepoints(string) ⇒ String

Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.

Parameters:

  • string (String)

Returns:

  • (String)

See Also:



49
50
51
# File 'lib/ebnf/ll1/lexer.rb', line 49

def self.unescape_codepoints(string)
  ::EBNF::Unescape.unescape_codepoints(string)
end

.unescape_string(input) ⇒ String

Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.

Parameters:

  • input (String)

Returns:

  • (String)

See Also:



61
62
63
# File 'lib/ebnf/ll1/lexer.rb', line 61

def self.unescape_string(input)
  ::EBNF::Unescape.unescape_string(input)
end

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

Enumerates each token in the input string.

Yields:

  • (token)

Yield Parameters:

Returns:

  • (Enumerator)


146
147
148
149
150
151
152
153
# File 'lib/ebnf/ll1/lexer.rb', line 146

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first(*types) ⇒ Token

Returns first token in input stream

Parameters:

  • types (Array[Symbol])

    Optional set of types for restricting terminals examined

Returns:



161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# File 'lib/ebnf/ll1/lexer.rb', line 161

def first(*types)
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return nil if scanner.eos?

    token = match_token(*types)

    if token.nil?
      lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        input: scanner.rest[0..100], token: lexme, lineno: lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new(e.message,
    input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#linenoInteger

The current line number (one-based).

Returns:

  • (Integer)


220
221
222
# File 'lib/ebnf/ll1/lexer.rb', line 220

def lineno
  scanner.lineno
end

#recover(*types) ⇒ Token

Skip input until a token is matched

Parameters:

  • types (Array[Symbol])

    Optional set of types for restricting terminals examined

Returns:



203
204
205
206
207
208
209
210
211
212
213
214
# File 'lib/ebnf/ll1/lexer.rb', line 203

def recover(*types)
   until scanner.eos? || tok = match_token(*types)
    if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token"
      # No whitespace at the end, must be and end of string
      scanner.terminate
    else
      skip_whitespace
    end
  end
  scanner.unscan if tok
  first
end

#shiftToken

Returns first token and shifts to next

Returns:



192
193
194
195
196
# File 'lib/ebnf/ll1/lexer.rb', line 192

def shift
  cur = first
  @first = nil
  cur
end

#valid?Boolean

Returns ‘true` if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.

Returns:

  • (Boolean)


132
133
134
135
136
137
138
# File 'lib/ebnf/ll1/lexer.rb', line 132

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end