Class: EBNF::LL1::Lexer

Inherits:

Object

Object
EBNF::LL1::Lexer

show all

Includes:: Unescape, Enumerable

Defined in:: lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = EBNF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = EBNF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  EBNF::LL1::Lexer.tokenize(query)
rescue EBNF::LL1::Lexer::Error => error
  warn error.inspect
end

Defined Under Namespace

Classes: Error, Terminal, Token

Constant Summary

Constants included from Unescape

Unescape::ECHAR, Unescape::ESCAPE_CHAR4, Unescape::ESCAPE_CHAR8, Unescape::ESCAPE_CHARS, Unescape::UCHAR

Instance Attribute Summary collapse

#input ⇒ String

The current input string being processed.
#options ⇒ Hash readonly

Any additional options for the lexer.
#whitespace ⇒ Regexp readonly

Defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Class Method Summary collapse

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Tokenizes the given ‘input` string or stream.

Instance Method Summary collapse

#each_token {|token| ... } ⇒ Enumerator (also: #each)

Enumerates each token in the input string.
#first(*types) ⇒ Token

Returns first token in input stream.
#initialize(input = nil, terminals = nil, **options) ⇒ Lexer constructor

Initializes a new lexer instance.
#lineno ⇒ Integer

The current line number (one-based).
#recover(*types) ⇒ Token

Skip input until a token is matched.
#shift ⇒ Token

Returns first token and shifts to next.
#valid? ⇒ Boolean

Returns ‘true` if the input string is lexically valid.

Methods included from Unescape

unescape, unescape_codepoints, unescape_string

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ `Lexer`

Initializes a new lexer instance.

Parameters:

input (String, #to_s) (defaults to: nil)
terminals (Array<Array<Symbol, Regexp>, Terminal>) (defaults to: nil) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object})
options[Integer] (Hash) —

a customizable set of options

Options Hash (**options):

:whitespace (Regexp) —

Whitespace between tokens, including comments

Raises:

(Error)

# File 'lib/ebnf/ll1/lexer.rb', line 70

def initialize(input = nil, terminals = nil, **options)
  @options        = options.dup
  @whitespace     = @options[:whitespace]
  @terminals      = terminals.map do |term|
    if term.is_a?(Array) && term.length ==3
      # Last element is options
      Terminal.new(term[0], term[1], **term[2])
    elsif term.is_a?(Array)
      Terminal.new(*term)
    else
      term
    end
  end

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @scanner = Scanner.new(input, **options)
end

Instance Attribute Details

#input ⇒ `String`

The current input string being processed.

Returns:

(String)



99
100
101

# File 'lib/ebnf/ll1/lexer.rb', line 99

def input
  @input
end

#options ⇒ `Hash` (readonly)

Any additional options for the lexer.

Returns:

(Hash)



93
94
95

# File 'lib/ebnf/ll1/lexer.rb', line 93

def options
  @options
end

#whitespace ⇒ `Regexp` (readonly)

Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Returns:

(Regexp) —

defines whitespace, including comments, otherwise whitespace must be explicit in terminals



39
40
41

# File 'lib/ebnf/ll1/lexer.rb', line 39

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ `Lexer`

Tokenizes the given ‘input` string or stream.

Parameters:

input (String, #to_s)
terminals (Array<Array<Symbol, Regexp>>) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object})

Yields:

(lexer)

Yield Parameters:

lexer (Lexer)

Returns:

(Lexer)

Raises:

(Lexer::Error) —

on invalid input

# File 'lib/ebnf/ll1/lexer.rb', line 53

def self.tokenize(input, terminals, **options, &block)
  lexer = self.new(input, terminals, **options)
  block_given? ? block.call(lexer) : lexer
end

Instance Method Details

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

Enumerates each token in the input string.

Yields:

(token)

Yield Parameters:

token (Token)

Returns:

(Enumerator)

# File 'lib/ebnf/ll1/lexer.rb', line 122

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first(*types) ⇒ `Token`

Returns first token in input stream

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 137

def first(*types)
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return nil if scanner.eos?

    token = match_token(*types)

    if token.nil?
      lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        input: scanner.rest[0..100], token: lexme, lineno: lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new(e.message,
    input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#lineno ⇒ `Integer`

The current line number (one-based).

Returns:

(Integer)



196
197
198

# File 'lib/ebnf/ll1/lexer.rb', line 196

def lineno
  scanner.lineno
end

#recover(*types) ⇒ `Token`

Skip input until a token is matched

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 179

def recover(*types)
   until scanner.eos? || tok = match_token(*types)
    if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token"
      # No whitespace at the end, must be and end of string
      scanner.terminate
    else
      skip_whitespace
    end
  end
  scanner.unscan if tok
  first
end

#shift ⇒ `Token`

Returns first token and shifts to next

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 168

def shift
  cur = first
  @first = nil
  cur
end

#valid? ⇒ `Boolean`

Returns ‘true` if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.

Returns:

(Boolean)

# File 'lib/ebnf/ll1/lexer.rb', line 108

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end

Class: EBNF::LL1::Lexer

Overview

Examples:

Tokenizing a Turtle string

Tokenizing and returning a token stream

Handling error conditions

Defined Under Namespace

Constant Summary

Constants included from Unescape

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ Lexer

Instance Attribute Details

#input ⇒ String

#options ⇒ Hash (readonly)

#whitespace ⇒ Regexp (readonly)

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

#first(*types) ⇒ Token

#lineno ⇒ Integer

#recover(*types) ⇒ Token

#shift ⇒ Token

#valid? ⇒ Boolean

#initialize(input = nil, terminals = nil, **options) ⇒ `Lexer`

#input ⇒ `String`

#options ⇒ `Hash` (readonly)

#whitespace ⇒ `Regexp` (readonly)

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ `Lexer`

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

#first(*types) ⇒ `Token`

#lineno ⇒ `Integer`

#recover(*types) ⇒ `Token`

#shift ⇒ `Token`

#valid? ⇒ `Boolean`