Class: EBNF::LL1::Lexer
- Inherits:
-
Object
- Object
- EBNF::LL1::Lexer
- Includes:
- Enumerable
- Defined in:
- lib/ebnf/ll1/lexer.rb
Overview
A lexical analyzer
Defined Under Namespace
Classes: Error, Terminal, Token
Constant Summary collapse
- ESCAPE_CHARS =
{ '\\t' => "\t", # \u0009 (tab) '\\n' => "\n", # \u000A (line feed) '\\r' => "\r", # \u000D (carriage return) '\\b' => "\b", # \u0008 (backspace) '\\f' => "\f", # \u000C (form feed) '\\"' => '"', # \u0022 (quotation mark, double quote mark) "\\'" => '\'', # \u0027 (apostrophe-quote, single quote mark) '\\\\' => '\\' # \u005C (backslash) }.freeze
- ESCAPE_CHAR4 =
u005C (backslash)
/\\u(?:[0-9A-Fa-f]{4,4})/.freeze
- ESCAPE_CHAR8 =
UXXXXXXXX
/\\U(?:[0-9A-Fa-f]{8,8})/.freeze
- ECHAR =
More liberal unescaping
/\\./
- UCHAR =
/#{ESCAPE_CHAR4}|#{ESCAPE_CHAR8}/.freeze
Instance Attribute Summary collapse
-
#input ⇒ String
The current input string being processed.
-
#lineno ⇒ Integer
readonly
The current line number (zero-based).
-
#options ⇒ Hash
readonly
Any additional options for the lexer.
-
#whitespace ⇒ Regexp
readonly
Defines whitespace, including comments, otherwise whitespace must be explicit in terminals.
Class Method Summary collapse
-
.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ Lexer
Tokenizes the given ‘input` string or stream.
-
.unescape_codepoints(string) ⇒ String
Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
-
.unescape_string(input) ⇒ String
Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.
Instance Method Summary collapse
-
#each_token {|token| ... } ⇒ Enumerator
(also: #each)
Enumerates each token in the input string.
-
#first(*types) ⇒ Token
Returns first token in input stream.
-
#initialize(input = nil, terminals = nil, options = {}) ⇒ Lexer
constructor
Initializes a new lexer instance.
-
#recover(*types) ⇒ Token
Skip input until a token is matched.
-
#shift ⇒ Token
Returns first token and shifts to next.
-
#valid? ⇒ Boolean
Returns ‘true` if the input string is lexically valid.
Constructor Details
#initialize(input = nil, terminals = nil, options = {}) ⇒ Lexer
Initializes a new lexer instance.
118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/ebnf/ll1/lexer.rb', line 118 def initialize(input = nil, terminals = nil, = {}) @options = .dup @whitespace = @options[:whitespace] @terminals = terminals.map do |term| term.is_a?(Array) ? Terminal.new(*term) : term end raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0 @lineno = 1 @scanner = Scanner.new(input, ) end |
Instance Attribute Details
#input ⇒ String
The current input string being processed.
141 142 143 |
# File 'lib/ebnf/ll1/lexer.rb', line 141 def input @input end |
#lineno ⇒ Integer (readonly)
The current line number (zero-based).
147 148 149 |
# File 'lib/ebnf/ll1/lexer.rb', line 147 def lineno @lineno end |
#options ⇒ Hash (readonly)
Any additional options for the lexer.
135 136 137 |
# File 'lib/ebnf/ll1/lexer.rb', line 135 def @options end |
#whitespace ⇒ Regexp (readonly)
Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.
53 54 55 |
# File 'lib/ebnf/ll1/lexer.rb', line 53 def whitespace @whitespace end |
Class Method Details
.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ Lexer
Tokenizes the given ‘input` string or stream.
101 102 103 104 |
# File 'lib/ebnf/ll1/lexer.rb', line 101 def self.tokenize(input, terminals, = {}, &block) lexer = self.new(input, terminals, ) block_given? ? block.call(lexer) : lexer end |
.unescape_codepoints(string) ⇒ String
Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/ebnf/ll1/lexer.rb', line 63 def self.unescape_codepoints(string) string = string.dup string.force_encoding(Encoding::ASCII_8BIT) if string.respond_to?(:force_encoding) # Decode \uXXXX and \UXXXXXXXX code points: string = string.gsub(UCHAR) do |c| s = [(c[2..-1]).hex].pack('U*') s.respond_to?(:force_encoding) ? s.force_encoding(Encoding::ASCII_8BIT) : s end string.force_encoding(Encoding::UTF_8) if string.respond_to?(:force_encoding) string end |
.unescape_string(input) ⇒ String
Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.
85 86 87 |
# File 'lib/ebnf/ll1/lexer.rb', line 85 def self.unescape_string(input) input.gsub(ECHAR) { |escaped| ESCAPE_CHARS[escaped] || escaped[1..-1]} end |
Instance Method Details
#each_token {|token| ... } ⇒ Enumerator Also known as: each
Enumerates each token in the input string.
170 171 172 173 174 175 176 177 |
# File 'lib/ebnf/ll1/lexer.rb', line 170 def each_token(&block) if block_given? while token = shift yield token end end enum_for(:each_token) end |
#first(*types) ⇒ Token
Returns first token in input stream
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
# File 'lib/ebnf/ll1/lexer.rb', line 185 def first(*types) return nil unless scanner @first ||= begin {} while !scanner.eos? && skip_whitespace return @scanner = nil if scanner.eos? token = match_token(*types) if token.nil? lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest raise Error.new("Invalid token #{lexme[0..100].inspect}", input: scanner.rest[0..100], token: lexme, lineno: lineno) end token end rescue ArgumentError, Encoding::CompatibilityError => e raise Error.new(e., input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno) rescue Error raise rescue STDERR.puts "Expected ArgumentError, got #{$!.class}" raise end |
#recover(*types) ⇒ Token
Skip input until a token is matched
227 228 229 230 231 232 233 234 235 236 237 238 |
# File 'lib/ebnf/ll1/lexer.rb', line 227 def recover(*types) until scanner.eos? || tok = match_token(*types) if scanner.skip_until(@whitespace || /\s/m).nil? # Skip past current "token" # No whitespace at the end, must be and end of string scanner.terminate else skip_whitespace end end scanner.unscan if tok first end |
#shift ⇒ Token
Returns first token and shifts to next
216 217 218 219 220 |
# File 'lib/ebnf/ll1/lexer.rb', line 216 def shift cur = first @first = nil cur end |
#valid? ⇒ Boolean
Returns ‘true` if the input string is lexically valid.
To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.
156 157 158 159 160 161 162 |
# File 'lib/ebnf/ll1/lexer.rb', line 156 def valid? begin !count.zero? rescue Error false end end |