Class: EBNF::LL1::Lexer

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = EBNF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = EBNF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  EBNF::LL1::Lexer.tokenize(query)
rescue EBNF::LL1::Lexer::Error => error
  warn error.inspect
end

See Also:

Defined Under Namespace

Classes: Error, Terminal, Token

Constant Summary collapse

ESCAPE_CHARS =
{
  '\\t'   => "\t",  # \u0009 (tab)
  '\\n'   => "\n",  # \u000A (line feed)
  '\\r'   => "\r",  # \u000D (carriage return)
  '\\b'   => "\b",  # \u0008 (backspace)
  '\\f'   => "\f",  # \u000C (form feed)
  '\\"'  => '"',    # \u0022 (quotation mark, double quote mark)
  "\\'"  => '\'',   # \u0027 (apostrophe-quote, single quote mark)
  '\\\\' => '\\'    # \u005C (backslash)
}.freeze
ESCAPE_CHAR4 =

u005C (backslash)

/\\u(?:[0-9A-Fa-f]{4,4})/u.freeze
ESCAPE_CHAR8 =

UXXXXXXXX

/\\U(?:[0-9A-Fa-f]{8,8})/u.freeze
ECHAR =

More liberal unescaping

/\\./u.freeze
UCHAR =
/#{ESCAPE_CHAR4}|#{ESCAPE_CHAR8}/n.freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ Lexer

Initializes a new lexer instance.

Options Hash (**options):

  • :whitespace (Regexp)

    Whitespace between tokens, including comments

Raises:



118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/ebnf/ll1/lexer.rb', line 118

def initialize(input = nil, terminals = nil, **options)
  @options        = options.dup
  @whitespace     = @options[:whitespace]
  @terminals      = terminals.map do |term|
    if term.is_a?(Array) && term.length ==3
      # Last element is options
      Terminal.new(term[0], term[1], **term[2])
    elsif term.is_a?(Array)
      Terminal.new(*term)
    else
      term
    end
  end

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @scanner = Scanner.new(input, **options)
end

Instance Attribute Details

#inputString

The current input string being processed.



147
148
149
# File 'lib/ebnf/ll1/lexer.rb', line 147

def input
  @input
end

#optionsHash (readonly)

Any additional options for the lexer.



141
142
143
# File 'lib/ebnf/ll1/lexer.rb', line 141

def options
  @options
end

#whitespaceRegexp (readonly)



53
54
55
# File 'lib/ebnf/ll1/lexer.rb', line 53

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Tokenizes the given ‘input` string or stream.

Yields:

  • (lexer)

Yield Parameters:

Raises:



101
102
103
104
# File 'lib/ebnf/ll1/lexer.rb', line 101

def self.tokenize(input, terminals, **options, &block)
  lexer = self.new(input, terminals, **options)
  block_given? ? block.call(lexer) : lexer
end

.unescape_codepoints(string) ⇒ String

Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.



63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/ebnf/ll1/lexer.rb', line 63

def self.unescape_codepoints(string)
  string = string.dup
  string.force_encoding(Encoding::ASCII_8BIT) if string.respond_to?(:force_encoding)

  # Decode \uXXXX and \UXXXXXXXX code points:
  string = string.gsub(UCHAR) do |c|
    s = [(c[2..-1]).hex].pack('U*')
    s.respond_to?(:force_encoding) ? s.force_encoding(Encoding::ASCII_8BIT) : s
  end

  string.force_encoding(Encoding::UTF_8) if string.respond_to?(:force_encoding) 
  string
end

.unescape_string(input) ⇒ String

Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.



85
86
87
# File 'lib/ebnf/ll1/lexer.rb', line 85

def self.unescape_string(input)
  input.gsub(ECHAR) { |escaped| ESCAPE_CHARS[escaped] || escaped[1..-1]}
end

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

Enumerates each token in the input string.

Yields:

  • (token)

Yield Parameters:



170
171
172
173
174
175
176
177
# File 'lib/ebnf/ll1/lexer.rb', line 170

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first(*types) ⇒ Token

Returns first token in input stream



185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/ebnf/ll1/lexer.rb', line 185

def first(*types)
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return nil if scanner.eos?

    token = match_token(*types)

    if token.nil?
      lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        input: scanner.rest[0..100], token: lexme, lineno: lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new(e.message,
    input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#linenoInteger

The current line number (one-based).



244
245
246
# File 'lib/ebnf/ll1/lexer.rb', line 244

def lineno
  scanner.lineno
end

#recover(*types) ⇒ Token

Skip input until a token is matched



227
228
229
230
231
232
233
234
235
236
237
238
# File 'lib/ebnf/ll1/lexer.rb', line 227

def recover(*types)
   until scanner.eos? || tok = match_token(*types)
    if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token"
      # No whitespace at the end, must be and end of string
      scanner.terminate
    else
      skip_whitespace
    end
  end
  scanner.unscan if tok
  first
end

#shiftToken

Returns first token and shifts to next



216
217
218
219
220
# File 'lib/ebnf/ll1/lexer.rb', line 216

def shift
  cur = first
  @first = nil
  cur
end

#valid?Boolean

Returns ‘true` if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.



156
157
158
159
160
161
162
# File 'lib/ebnf/ll1/lexer.rb', line 156

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end