Class: Rlex::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/rlex/lexer.rb

Overview

Implements a simple lexer using a StringScanner.

The lexer was written for use with Racc, a Ruby variant of Yacc. But there is no code dependency on that project so the lexer may also be used on its own or with other packages.

  • Ignored input takes precedence over rules and keywords, so if a prefix is matched by an ignore pattern, it’s ignored even if it’s also a keyword or matched by a rule

  • The lexer is greedy, so if a prefix is matched by multiple rules or keywords, the lexer chooses the option consuming the most input

Examples:

Basic usage

# Define behavior
lexer = Lexer.new
lexer.ignore /\s+/                   # ignore whitespace
lexer.rule :word, /\w+/              # consider any text a 'word'
lexer.keyword :if                    # treat 'if' as a special keyword
lexer.keyword :lparen, "("           # any fixed input such as parentheses
lexer.keyword :rparen, ")"           #   may be defined as keywords

# Initialize with input
lexer.start "if ( foo ) bar"         # initialize the lexer with a string

# Iterate through tokens
lexer.next_token # => Token (type = :if,     value = 'if')
lexer.next_token # => Token (type = :lparen, value = '(')
lexer.next_token # => Token (type = :word,   value = 'foo')
lexer.next_token # => Token (type = :rparen, value = ')')
lexer.next_token # => Token (type = :word,   value = 'bar')
lexer.next_token # => EOF_TOKEN

Author:

Instance Method Summary collapse

Constructor Details

#initializeLexer

Initializes an empty Lexer.



43
44
45
46
47
# File 'lib/rlex/lexer.rb', line 43

def initialize
  @ignored = []
  @rules = []
  @keywords = {}
end

Instance Method Details

#ignore(pattern) ⇒ Regexp

Note:

Ignored input takes precedence over rules and keywords, so if a prefix is matched by an ignore pattern, it’s ignored even if it’s also a keyword or matched by a rule

Instructs the lexer to ignore input matched by the specified pattern. If appropriate, call this multiple times to ignore several patterns.

Parameters:

  • pattern (Regexp)

    Pattern of input to ignore

Returns:

  • (Regexp)

    The specified pattern



61
62
63
64
# File 'lib/rlex/lexer.rb', line 61

def ignore(pattern)
  @ignored << pattern
  return pattern
end

#keyword(name = nil, kword) ⇒ Symbol

Note:

Use keywords for efficiency instead of rules whenever the matched input is static

Defines a static sequence of input as a keyword.

Parameters:

  • name (optional, Symbol, #to_sym) (defaults to: nil)

    Unique name of the keyword. If this argument is not given, the keyword is used to name itself

  • kword (String, #to_s)

    Sequence of input to match as a keyword

Returns:

  • (Symbol)

    The name of the keyword

Raises:

  • (ArgumentError)

    If the specified name is already used by other rules or keywords



101
102
103
104
105
106
107
108
109
# File 'lib/rlex/lexer.rb', line 101

def keyword(name = nil, kword)
  # @todo Validate the keyword name
  kword_str = kword.to_s
  name = kword.to_sym if name == nil
  pattern = Regexp.new(Regexp.escape kword_str)
  rule name, pattern
  @keywords[kword_str] = Token.new name.to_sym, kword_str
  return name.to_sym
end

#next_tokenToken

Returns the next token matched from the remaining input. If no input is left, or the lexer has not been initialized, EOF_TOKEN is returned.

Returns:

  • (Token)

    Next token or EOF_TOKEN

Raises:

  • (RuntimeError)

    If there is any unmatched input



135
136
137
138
139
140
141
142
143
144
145
146
147
# File 'lib/rlex/lexer.rb', line 135

def next_token
  return EOF_TOKEN if @scanner.nil? or @scanner.empty?
  return next_token if ignore_prefix?
  rule = greediest_rule
  if rule
    prefix = fetch_prefix_and_update_pos(rule.pattern)
    keyword = @keywords[prefix]
    type = keyword ? keyword.type : rule.name
    token = keyword ? keyword.value : prefix
    return Token.new(type, token, @line, @col - token.size)
  end
  raise "unexpected input <#{@scanner.peek(5)}>"
end

#rule(name, pattern) ⇒ Symbol

Note:

Use keywords for efficiency instead of rules whenever the matched input is static

Defines a rule to match the specified pattern.

Parameters:

  • name (Symbol, #to_sym)

    Unique name of rule

  • pattern (Regexp)

    Pattern of input to match

Returns:

  • (Symbol)

    The name of the rule

Raises:

  • (ArgumentError)

    If the specified name is already used by other rules or keywords



79
80
81
82
83
# File 'lib/rlex/lexer.rb', line 79

def rule(name, pattern)
  # @todo Validate the rule name
  @rules << (Rule.new name.to_sym, pattern)
  return name.to_sym
end

#start(input) ⇒ String

Note:

This resets the lexer with a new StringScanner so any state information related to previous input is lost

Initializes the lexer with new input.

Parameters:

  • input (String)

    Input to scan for tokens

Returns:

  • (String)

    The specified input



120
121
122
123
124
125
# File 'lib/rlex/lexer.rb', line 120

def start(input)
  @line = 1
  @col = 0
  @scanner = StringScanner.new input
  return input
end