Class: RDF::LL1::Lexer

Inherits:

Object

Object
RDF::LL1::Lexer

show all

Includes:: Enumerable

Defined in:: lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = RDF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = RDF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  RDF::Turtle::Lexer.tokenize(query)
rescue RDF::Turtle::Lexer::Error => error
  warn error.inspect
end

Defined Under Namespace

Modules: Encoding Classes: Error, Token

Constant Summary collapse

ESCAPE_CHARS =

{
  '\\t'   => "\t",  # \u0009 (tab)
  '\\n'   => "\n",  # \u000A (line feed)
  '\\r'   => "\r",  # \u000D (carriage return)
  '\\b'   => "\b",  # \u0008 (backspace)
  '\\f'   => "\f",  # \u000C (form feed)
  '\\"'  => '"',    # \u0022 (quotation mark, double quote mark)
  "\\'"  => '\'',   # \u0027 (apostrophe-quote, single quote mark)
  '\\\\' => '\\'    # \u005C (backslash)
}

ESCAPE_CHAR4 = u005C (backslash)

/\\u(?:[0-9A-Fa-f]{4,4})/

ESCAPE_CHAR8 = UXXXXXXXX

/\\U(?:[0-9A-Fa-f]{8,8})/

ECHAR = More liberal unescaping

/\\./

UCHAR =

/#{ESCAPE_CHAR4}|#{ESCAPE_CHAR8}/

COMMENT =

/#.*/

WS =

/ |\t|\r|\n/m

ML_START = Beginning of terminals that may span lines

/\'\'\'|\"\"\"/

Instance Attribute Summary collapse

#comment ⇒ Regexp

Defines single-line comment, defaults to COMMENT.
#input ⇒ String

The current input string being processed.
#lineno ⇒ Integer readonly

The current line number (zero-based).
#options ⇒ Hash readonly

Any additional options for the lexer.
#whitespace ⇒ Regexp

Defines whitespace, defaults to WS.

Class Method Summary collapse

.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ Lexer

Tokenizes the given ‘input` string or stream.
.unescape_codepoints(string) ⇒ String

Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
.unescape_string(input) ⇒ String

Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.

Instance Method Summary collapse

#each_token {|token| ... } ⇒ Enumerator (also: #each)

Enumerates each token in the input string.
#first ⇒ Token

Returns first token in input stream.
#initialize(input = nil, terminals = nil, options = {}) ⇒ Lexer constructor

Initializes a new lexer instance.
#recover ⇒ Token

Skip input until a token is matched.
#shift ⇒ Token

Returns first token and shifts to next.
#valid? ⇒ Boolean

Returns ‘true` if the input string is lexically valid.

Constructor Details

#initialize(input = nil, terminals = nil, options = {}) ⇒ `Lexer`

Initializes a new lexer instance.

Parameters:

input (String, #to_s) (defaults to: nil)
terminals (Array<Array<Symbol, Regexp>>) (defaults to: nil) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object}) (defaults to: {})

Options Hash (options):

:whitespace (Regexp) — default: WS
:comment (Regexp) — default: COMMENT
:unescape_terms (Array<Symbol>) — default: [] —

Regular expression matching the beginning of terminals that may cross newlines

Raises:

(Error)

# File 'lib/ebnf/ll1/lexer.rb', line 125

def initialize(input = nil, terminals = nil, options = {})
  @options        = options.dup
  @whitespace     = @options[:whitespace]     || WS
  @comment        = @options[:comment]        || COMMENT
  @unescape_terms = @options[:unescape_terms] || []
  @terminals      = terminals

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @lineno = 1
  @scanner = Scanner.new(input) do |string|
    string.force_encoding(Encoding::UTF_8) if string.respond_to?(:force_encoding)      # Ruby 1.9+
    string
  end
end

Instance Attribute Details

#comment ⇒ `Regexp`

Returns defines single-line comment, defaults to COMMENT.

Returns:

(Regexp) —

defines single-line comment, defaults to COMMENT



63
64
65

# File 'lib/ebnf/ll1/lexer.rb', line 63

def comment
  @comment
end

#input ⇒ `String`

The current input string being processed.

Returns:

(String)



151
152
153

# File 'lib/ebnf/ll1/lexer.rb', line 151

def input
  @input
end

#lineno ⇒ `Integer` (readonly)

The current line number (zero-based).

Returns:

(Integer)



157
158
159

# File 'lib/ebnf/ll1/lexer.rb', line 157

def lineno
  @lineno
end

#options ⇒ `Hash` (readonly)

Any additional options for the lexer.

Returns:

(Hash)



145
146
147

# File 'lib/ebnf/ll1/lexer.rb', line 145

def options
  @options
end

#whitespace ⇒ `Regexp`

Returns defines whitespace, defaults to WS.

Returns:

(Regexp) —

defines whitespace, defaults to WS



58
59
60

# File 'lib/ebnf/ll1/lexer.rb', line 58

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ `Lexer`

Tokenizes the given ‘input` string or stream.

Parameters:

input (String, #to_s)
terminals (Array<Array<Symbol, Regexp>>) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object}) (defaults to: {})

Yields:

(lexer)

Yield Parameters:

lexer (Lexer)

Returns:

(Lexer)

Raises:

(Lexer::Error) —

on invalid input

# File 'lib/ebnf/ll1/lexer.rb', line 108

def self.tokenize(input, terminals, options = {}, &block)
  lexer = self.new(input, terminals, options)
  block_given? ? block.call(lexer) : lexer
end

.unescape_codepoints(string) ⇒ `String`

Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.

Parameters:

string (String)

Returns:

(String)

.unescape_string(input) ⇒ `String`

Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.

Parameters:

input (String)

Returns:

(String)

Instance Method Details

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

Enumerates each token in the input string.

Yields:

(token)

Yield Parameters:

token (Token)

Returns:

(Enumerator)

# File 'lib/ebnf/ll1/lexer.rb', line 180

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first ⇒ `Token`

Returns first token in input stream

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 194

def first
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return @scanner = nil if scanner.eos?

    token = match_token

    if token.nil?
      lexme = (scanner.rest.split(/#{@whitespace}|#{@comment}/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        :input => scanner.rest[0..100], :token => lexme, :lineno => lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new("#{e.message} on line #{lineno + 1}",
    :input => (scanner.rest[0..100] rescue '??'), :token => lexme, :lineno => lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#recover ⇒ `Token`

Skip input until a token is matched

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 235

def recover
  until scanner.eos? do
    begin
      shift
      return first
    rescue Error, ArgumentError
      # Ignore errors until something scans, or EOS.
      scanner.pos = scanner.pos + 1
    end
  end
end

#shift ⇒ `Token`

Returns first token and shifts to next

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 225

def shift
  cur = first
  @first = nil
  cur
end

#valid? ⇒ `Boolean`

Returns ‘true` if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.

Returns:

(Boolean)

# File 'lib/ebnf/ll1/lexer.rb', line 166

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end

Class: RDF::LL1::Lexer

Overview

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input = nil, terminals = nil, options = {}) ⇒ Lexer

Instance Attribute Details

#comment ⇒ Regexp

#input ⇒ String

#lineno ⇒ Integer (readonly)

#options ⇒ Hash (readonly)

#whitespace ⇒ Regexp

Class Method Details

.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ Lexer

.unescape_codepoints(string) ⇒ String

.unescape_string(input) ⇒ String

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

#first ⇒ Token

#recover ⇒ Token

#shift ⇒ Token

#valid? ⇒ Boolean

#initialize(input = nil, terminals = nil, options = {}) ⇒ `Lexer`

#comment ⇒ `Regexp`

#input ⇒ `String`

#lineno ⇒ `Integer` (readonly)

#options ⇒ `Hash` (readonly)

#whitespace ⇒ `Regexp`

.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ `Lexer`

.unescape_codepoints(string) ⇒ `String`

.unescape_string(input) ⇒ `String`

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

#first ⇒ `Token`

#recover ⇒ `Token`

#shift ⇒ `Token`

#valid? ⇒ `Boolean`