Class: EBNF::LL1::Lexer

Inherits:

Object

Object
EBNF::LL1::Lexer

show all

Includes:: Unescape, Enumerable

Defined in:: lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = EBNF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = EBNF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  EBNF::LL1::Lexer.tokenize(query)
rescue EBNF::LL1::Lexer::Error => error
  warn error.inspect
end

Defined Under Namespace

Classes: Error, Terminal, Token

Constant Summary

Constants included from Unescape

Unescape::ECHAR, Unescape::ESCAPE_CHAR4, Unescape::ESCAPE_CHAR8, Unescape::ESCAPE_CHARS, Unescape::UCHAR

Instance Attribute Summary collapse

#input ⇒ String

The current input string being processed.
#options ⇒ Hash readonly

Any additional options for the lexer.
#whitespace ⇒ Regexp readonly

Defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Class Method Summary collapse

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Tokenizes the given ‘input` string or stream.
.unescape_codepoints(string) ⇒ String

Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
.unescape_string(input) ⇒ String

Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.

Instance Method Summary collapse

#each_token {|token| ... } ⇒ Enumerator (also: #each)

Enumerates each token in the input string.
#first(*types) ⇒ Token

Returns first token in input stream.
#initialize(input = nil, terminals = nil, **options) ⇒ Lexer constructor

Initializes a new lexer instance.
#lineno ⇒ Integer

The current line number (one-based).
#recover(*types) ⇒ Token

Skip input until a token is matched.
#shift ⇒ Token

Returns first token and shifts to next.
#valid? ⇒ Boolean

Returns ‘true` if the input string is lexically valid.

Methods included from Unescape

unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ `Lexer`

Initializes a new lexer instance.

Parameters:

input (String, #to_s) (defaults to: nil)
terminals (Array<Array<Symbol, Regexp>, Terminal>) (defaults to: nil) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object})
options[Integer] (Hash) —

a customizable set of options

Options Hash (**options):

:whitespace (Regexp) —

Whitespace between tokens, including comments

Raises:

(Error)

# File 'lib/ebnf/ll1/lexer.rb', line 94

def initialize(input = nil, terminals = nil, **options)
  @options        = options.dup
  @whitespace     = @options[:whitespace]
  @terminals      = terminals.map do |term|
    if term.is_a?(Array) && term.length ==3
      # Last element is options
      Terminal.new(term[0], term[1], **term[2])
    elsif term.is_a?(Array)
      Terminal.new(*term)
    else
      term
    end
  end

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @scanner = Scanner.new(input, **options)
end

Instance Attribute Details

#input ⇒ `String`

The current input string being processed.

Returns:

(String)



123
124
125

# File 'lib/ebnf/ll1/lexer.rb', line 123

def input
  @input
end

#options ⇒ `Hash` (readonly)

Any additional options for the lexer.

Returns:

(Hash)



117
118
119

# File 'lib/ebnf/ll1/lexer.rb', line 117

def options
  @options
end

#whitespace ⇒ `Regexp` (readonly)

Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Returns:

(Regexp) —

defines whitespace, including comments, otherwise whitespace must be explicit in terminals



39
40
41

# File 'lib/ebnf/ll1/lexer.rb', line 39

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ `Lexer`

Tokenizes the given ‘input` string or stream.

Parameters:

input (String, #to_s)
terminals (Array<Array<Symbol, Regexp>>) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object})

Yields:

(lexer)

Yield Parameters:

lexer (Lexer)

Returns:

(Lexer)

Raises:

(Lexer::Error) —

on invalid input

# File 'lib/ebnf/ll1/lexer.rb', line 77

def self.tokenize(input, terminals, **options, &block)
  lexer = self.new(input, terminals, **options)
  block_given? ? block.call(lexer) : lexer
end

.unescape_codepoints(string) ⇒ `String`

Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.

Parameters:

string (String)

Returns:

(String)

.unescape_string(input) ⇒ `String`

Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.

Parameters:

input (String)

Returns:

(String)

Instance Method Details

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

Enumerates each token in the input string.

Yields:

(token)

Yield Parameters:

token (Token)

Returns:

(Enumerator)

# File 'lib/ebnf/ll1/lexer.rb', line 146

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first(*types) ⇒ `Token`

Returns first token in input stream

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 161

def first(*types)
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return nil if scanner.eos?

    token = match_token(*types)

    if token.nil?
      lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        input: scanner.rest[0..100], token: lexme, lineno: lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new(e.message,
    input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#lineno ⇒ `Integer`

The current line number (one-based).

Returns:

(Integer)



220
221
222

# File 'lib/ebnf/ll1/lexer.rb', line 220

def lineno
  scanner.lineno
end

#recover(*types) ⇒ `Token`

Skip input until a token is matched

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 203

def recover(*types)
   until scanner.eos? || tok = match_token(*types)
    if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token"
      # No whitespace at the end, must be and end of string
      scanner.terminate
    else
      skip_whitespace
    end
  end
  scanner.unscan if tok
  first
end

#shift ⇒ `Token`

Returns first token and shifts to next

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 192

def shift
  cur = first
  @first = nil
  cur
end

#valid? ⇒ `Boolean`

Returns ‘true` if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.

Returns:

(Boolean)

# File 'lib/ebnf/ll1/lexer.rb', line 132

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end

Class: EBNF::LL1::Lexer

Overview

Examples:

Tokenizing a Turtle string

Tokenizing and returning a token stream

Handling error conditions

Defined Under Namespace

Constant Summary

Constants included from Unescape

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ Lexer

Instance Attribute Details

#input ⇒ String

#options ⇒ Hash (readonly)

#whitespace ⇒ Regexp (readonly)

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

.unescape_codepoints(string) ⇒ String

.unescape_string(input) ⇒ String

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

#first(*types) ⇒ Token

#lineno ⇒ Integer

#recover(*types) ⇒ Token

#shift ⇒ Token

#valid? ⇒ Boolean

#initialize(input = nil, terminals = nil, **options) ⇒ `Lexer`

#input ⇒ `String`

#options ⇒ `Hash` (readonly)

#whitespace ⇒ `Regexp` (readonly)

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ `Lexer`

.unescape_codepoints(string) ⇒ `String`

.unescape_string(input) ⇒ `String`

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

#first(*types) ⇒ `Token`

#lineno ⇒ `Integer`

#recover(*types) ⇒ `Token`

#shift ⇒ `Token`

#valid? ⇒ `Boolean`