Class: RDF::LL1::Lexer
- Inherits:
-
Object
- Object
- RDF::LL1::Lexer
- Includes:
- Enumerable
- Defined in:
- lib/ebnf/ll1/lexer.rb
Overview
A lexical analyzer
Defined Under Namespace
Modules: Encoding Classes: Error, Token
Constant Summary collapse
- ESCAPE_CHARS =
{ '\\t' => "\t", # \u0009 (tab) '\\n' => "\n", # \u000A (line feed) '\\r' => "\r", # \u000D (carriage return) '\\b' => "\b", # \u0008 (backspace) '\\f' => "\f", # \u000C (form feed) '\\"' => '"', # \u0022 (quotation mark, double quote mark) "\\'" => '\'', # \u0027 (apostrophe-quote, single quote mark) '\\\\' => '\\' # \u005C (backslash) }
- ESCAPE_CHAR4 =
u005C (backslash)
/\\u(?:[0-9A-Fa-f]{4,4})/
- ESCAPE_CHAR8 =
UXXXXXXXX
/\\U(?:[0-9A-Fa-f]{8,8})/
- ECHAR =
More liberal unescaping
/\\./
- UCHAR =
/#{ESCAPE_CHAR4}|#{ESCAPE_CHAR8}/
- COMMENT =
/#.*/
- WS =
/ |\t|\r|\n/m
- ML_START =
Beginning of terminals that may span lines
/\'\'\'|\"\"\"/
Instance Attribute Summary collapse
-
#comment ⇒ Regexp
Defines single-line comment, defaults to COMMENT.
-
#input ⇒ String
The current input string being processed.
-
#lineno ⇒ Integer
readonly
The current line number (zero-based).
-
#options ⇒ Hash
readonly
Any additional options for the lexer.
-
#whitespace ⇒ Regexp
Defines whitespace, defaults to WS.
Class Method Summary collapse
-
.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ Lexer
Tokenizes the given ‘input` string or stream.
-
.unescape_codepoints(string) ⇒ String
Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
-
.unescape_string(input) ⇒ String
Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.
Instance Method Summary collapse
-
#each_token {|token| ... } ⇒ Enumerator
(also: #each)
Enumerates each token in the input string.
-
#first ⇒ Token
Returns first token in input stream.
-
#initialize(input = nil, terminals = nil, options = {}) ⇒ Lexer
constructor
Initializes a new lexer instance.
-
#recover ⇒ Token
Skip input until a token is matched.
-
#shift ⇒ Token
Returns first token and shifts to next.
-
#valid? ⇒ Boolean
Returns ‘true` if the input string is lexically valid.
Constructor Details
#initialize(input = nil, terminals = nil, options = {}) ⇒ Lexer
Initializes a new lexer instance.
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/ebnf/ll1/lexer.rb', line 125 def initialize(input = nil, terminals = nil, = {}) @options = .dup @whitespace = @options[:whitespace] || WS @comment = @options[:comment] || COMMENT @unescape_terms = @options[:unescape_terms] || [] @terminals = terminals raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0 @lineno = 1 @scanner = Scanner.new(input) do |string| string.force_encoding(Encoding::UTF_8) if string.respond_to?(:force_encoding) # Ruby 1.9+ string end end |
Instance Attribute Details
#comment ⇒ Regexp
Returns defines single-line comment, defaults to COMMENT.
63 64 65 |
# File 'lib/ebnf/ll1/lexer.rb', line 63 def comment @comment end |
#input ⇒ String
The current input string being processed.
151 152 153 |
# File 'lib/ebnf/ll1/lexer.rb', line 151 def input @input end |
#lineno ⇒ Integer (readonly)
The current line number (zero-based).
157 158 159 |
# File 'lib/ebnf/ll1/lexer.rb', line 157 def lineno @lineno end |
#options ⇒ Hash (readonly)
Any additional options for the lexer.
145 146 147 |
# File 'lib/ebnf/ll1/lexer.rb', line 145 def @options end |
#whitespace ⇒ Regexp
Returns defines whitespace, defaults to WS.
58 59 60 |
# File 'lib/ebnf/ll1/lexer.rb', line 58 def whitespace @whitespace end |
Class Method Details
.tokenize(input, terminals, options = {}) {|lexer| ... } ⇒ Lexer
Tokenizes the given ‘input` string or stream.
108 109 110 111 |
# File 'lib/ebnf/ll1/lexer.rb', line 108 def self.tokenize(input, terminals, = {}, &block) lexer = self.new(input, terminals, ) block_given? ? block.call(lexer) : lexer end |
.unescape_codepoints(string) ⇒ String
Returns a copy of the given ‘input` string with all `uXXXX` and `UXXXXXXXX` Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
73 74 75 76 77 78 79 80 81 82 |
# File 'lib/ebnf/ll1/lexer.rb', line 73 def self.unescape_codepoints(string) # Decode \uXXXX and \UXXXXXXXX code points: string = string.gsub(UCHAR) do |c| s = [(c[2..-1]).hex].pack('U*') s.respond_to?(:force_encoding) ? s.force_encoding(Encoding::ASCII_8BIT) : s end string.force_encoding(Encoding::UTF_8) if string.respond_to?(:force_encoding) # Ruby 1.9+ string end |
.unescape_string(input) ⇒ String
Returns a copy of the given ‘input` string with all string escape sequences (e.g. `n` and `t`) replaced with their unescaped UTF-8 character counterparts.
92 93 94 |
# File 'lib/ebnf/ll1/lexer.rb', line 92 def self.unescape_string(input) input.gsub(ECHAR) { |escaped| ESCAPE_CHARS[escaped] || escaped[1..-1]} end |
Instance Method Details
#each_token {|token| ... } ⇒ Enumerator Also known as: each
Enumerates each token in the input string.
180 181 182 183 184 185 186 187 |
# File 'lib/ebnf/ll1/lexer.rb', line 180 def each_token(&block) if block_given? while token = shift yield token end end enum_for(:each_token) end |
#first ⇒ Token
Returns first token in input stream
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
# File 'lib/ebnf/ll1/lexer.rb', line 194 def first return nil unless scanner @first ||= begin {} while !scanner.eos? && skip_whitespace return @scanner = nil if scanner.eos? token = match_token if token.nil? lexme = (scanner.rest.split(/#{@whitespace}|#{@comment}/).first rescue nil) || scanner.rest raise Error.new("Invalid token #{lexme[0..100].inspect}", :input => scanner.rest[0..100], :token => lexme, :lineno => lineno) end token end rescue ArgumentError, Encoding::CompatibilityError => e raise Error.new("#{e.} on line #{lineno + 1}", :input => (scanner.rest[0..100] rescue '??'), :token => lexme, :lineno => lineno) rescue Error raise rescue STDERR.puts "Expected ArgumentError, got #{$!.class}" raise end |
#recover ⇒ Token
Skip input until a token is matched
235 236 237 238 239 240 241 242 243 244 245 |
# File 'lib/ebnf/ll1/lexer.rb', line 235 def recover until scanner.eos? do begin shift return first rescue Error, ArgumentError # Ignore errors until something scans, or EOS. scanner.pos = scanner.pos + 1 end end end |
#shift ⇒ Token
Returns first token and shifts to next
225 226 227 228 229 |
# File 'lib/ebnf/ll1/lexer.rb', line 225 def shift cur = first @first = nil cur end |
#valid? ⇒ Boolean
Returns ‘true` if the input string is lexically valid.
To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.
166 167 168 169 170 171 172 |
# File 'lib/ebnf/ll1/lexer.rb', line 166 def valid? begin !count.zero? rescue Error false end end |