Class: Rouge::RegexLexer Abstract
Overview
A stateful lexer that uses sets of regular expressions to tokenize a string. Most lexers are instances of RegexLexer.
Direct Known Subclasses
Lexers::ABAP, Lexers::Actionscript, Lexers::Apache, Lexers::AppleScript, Lexers::Awk, Lexers::Bsl, Lexers::C, Lexers::CMake, Lexers::CSS, Lexers::CSharp, Lexers::Ceylon, Lexers::Cfscript, Lexers::Clojure, Lexers::Coffeescript, Lexers::CommonLisp, Lexers::Conf, Lexers::Coq, Lexers::Crystal, Lexers::D, Lexers::Dart, Lexers::Diff, Lexers::Docker, Lexers::Dot, Lexers::Eiffel, Lexers::Elixir, Lexers::Elm, Lexers::Erlang, Lexers::FSharp, Lexers::Factor, Lexers::Fortran, Lexers::Gherkin, Lexers::Go, Lexers::GraphQL, Lexers::Groovy, Lexers::HTML, Lexers::HTTP, Lexers::Haml, Lexers::Haskell, Lexers::Hcl, Lexers::HyLang, Lexers::IDLang, Lexers::INI, Lexers::IO, Lexers::IgorPro, Lexers::JSON, Lexers::Java, Lexers::Javascript, Lexers::Jsonnet, Lexers::Julia, Lexers::Kotlin, Lexers::LLVM, Lexers::Liquid, Lexers::LiterateCoffeescript, Lexers::LiterateHaskell, Lexers::Lua, Lexers::M68k, Lexers::MXML, Lexers::Make, Lexers::Markdown, Lexers::Mathematica, Lexers::Matlab, Lexers::Moonscript, Lexers::Mosel, Lexers::Nasm, Lexers::Nginx, Lexers::Nim, Lexers::Nix, Lexers::OCaml, Lexers::Pascal, Lexers::Perl, Lexers::Plist, Lexers::Pony, Lexers::Praat, Lexers::Prolog, Lexers::Prometheus, Lexers::Properties, Lexers::Protobuf, Lexers::Puppet, Lexers::Python, Lexers::Q, Lexers::R, Lexers::Racket, Lexers::Ruby, Lexers::Rust, Lexers::SML, Lexers::SQF, Lexers::SQL, Lexers::SassCommon, Lexers::Scala, Lexers::Scheme, Lexers::Sed, Lexers::Sed::Regex, Lexers::Sed::Replacement, Lexers::Shell, Lexers::Sieve, Lexers::Slim, Lexers::Smalltalk, Lexers::Swift, Lexers::TCL, Lexers::TOML, Lexers::Tap, Lexers::TeX, Lexers::Tulip, Lexers::Turtle, Lexers::VHDL, Lexers::Vala, Lexers::Verilog, Lexers::VimL, Lexers::VisualBasic, Lexers::Wollok, Lexers::XML, Lexers::YAML, TemplateLexer
Defined Under Namespace
Classes: Rule, State, StateDSL
Constant Summary collapse
- MAX_NULL_SCANS =
The number of successive scans permitted without consuming the input stream. If this is exceeded, the match fails.
5
Constants included from Token::Tokens
Token::Tokens::Num, Token::Tokens::Str
Instance Attribute Summary
Attributes inherited from Lexer
Class Method Summary collapse
- .append(name, &b) ⇒ Object
- .get_state(name) ⇒ Object
- .prepend(name, &b) ⇒ Object
- .replace_state(name, new_defn) ⇒ Object
-
.start(&b) ⇒ Object
Specify an action to be run every fresh lex.
-
.start_procs ⇒ Object
The routines to run at the beginning of a fresh lex.
-
.state(name, &b) ⇒ Object
Define a new state for this lexer with the given name.
- .state_definitions ⇒ Object
-
.states ⇒ Object
The states hash for this lexer.
Instance Method Summary collapse
-
#delegate(lexer, text = nil) ⇒ Object
Delegate the lex to another lexer.
- #get_state(state_name) ⇒ Object
-
#goto(state_name) ⇒ Object
replace the head of the stack with the given state.
- #group(tok) ⇒ Object deprecated Deprecated.
-
#groups(*tokens) ⇒ Object
Yield tokens corresponding to the matched groups of the current match.
-
#in_state?(state_name) ⇒ Boolean
Check if ‘state_name` is in the state stack.
-
#pop!(times = 1) ⇒ Object
Pop the state stack.
-
#push(state_name = nil, &b) ⇒ Object
Push a state onto the stack.
- #recurse(text = nil) ⇒ Object
-
#reset! ⇒ Object
reset this lexer to its initial state.
-
#reset_stack ⇒ Object
reset the stack back to ‘[:root]`.
-
#stack ⇒ Object
The state stack.
-
#state ⇒ Object
The current state - i.e.
-
#state?(state_name) ⇒ Boolean
Check if ‘state_name` is the state on top of the state stack.
-
#step(state, stream) ⇒ Object
Runs one step of the lex.
-
#stream_tokens(str, &b) ⇒ Object
This implements the lexer protocol, by yielding [token, value] pairs.
-
#token(tok, val = ) ⇒ Object
Yield a token.
Methods inherited from Lexer
aliases, all, #as_bool, #as_lexer, #as_list, #as_string, #as_token, assert_utf8!, #bool_option, debug_enabled?, demo, demo_file, desc, detect?, disable_debug!, enable_debug!, filenames, find, find_fancy, guess, guess_by_filename, guess_by_mimetype, guess_by_source, guesses, #hash_option, #initialize, #lex, lex, #lexer_option, #list_option, mimetypes, option, option_docs, #string_option, tag, #tag, title, #token_option
Methods included from Token::Tokens
Constructor Details
This class inherits a constructor from Rouge::Lexer
Class Method Details
.append(name, &b) ⇒ Object
196 197 198 199 200 |
# File 'lib/rouge/regex_lexer.rb', line 196 def self.append(name, &b) name = name.to_s dsl = state_definitions[name] or raise "no such state #{name.inspect}" replace_state(name, dsl.appended(&b)) end |
.get_state(name) ⇒ Object
203 204 205 206 207 208 209 210 |
# File 'lib/rouge/regex_lexer.rb', line 203 def self.get_state(name) return name if name.is_a? State states[name.to_sym] ||= begin defn = state_definitions[name.to_s] or raise "unknown state: #{name.inspect}" defn.to_state(self) end end |
.prepend(name, &b) ⇒ Object
190 191 192 193 194 |
# File 'lib/rouge/regex_lexer.rb', line 190 def self.prepend(name, &b) name = name.to_s dsl = state_definitions[name] or raise "no such state #{name.inspect}" replace_state(name, dsl.prepended(&b)) end |
.replace_state(name, new_defn) ⇒ Object
163 164 165 166 |
# File 'lib/rouge/regex_lexer.rb', line 163 def self.replace_state(name, new_defn) states[name] = nil state_definitions[name] = new_defn end |
.start(&b) ⇒ Object
Specify an action to be run every fresh lex.
179 180 181 |
# File 'lib/rouge/regex_lexer.rb', line 179 def self.start(&b) start_procs << b end |
.start_procs ⇒ Object
The routines to run at the beginning of a fresh lex.
170 171 172 |
# File 'lib/rouge/regex_lexer.rb', line 170 def self.start_procs @start_procs ||= InheritableList.new(superclass.start_procs) end |
.state(name, &b) ⇒ Object
Define a new state for this lexer with the given name. The block will be evaluated in the context of a StateDSL.
185 186 187 188 |
# File 'lib/rouge/regex_lexer.rb', line 185 def self.state(name, &b) name = name.to_s state_definitions[name] = StateDSL.new(name, &b) end |
.state_definitions ⇒ Object
158 159 160 |
# File 'lib/rouge/regex_lexer.rb', line 158 def self.state_definitions @state_definitions ||= InheritableHash.new(superclass.state_definitions) end |
.states ⇒ Object
The states hash for this lexer.
154 155 156 |
# File 'lib/rouge/regex_lexer.rb', line 154 def self.states @states ||= {} end |
Instance Method Details
#delegate(lexer, text = nil) ⇒ Object
Delegate the lex to another lexer. The #lex method will be called with ‘:continue` set to true, so that #reset! will not be called. In this way, a single lexer can be repeatedly delegated to while maintaining its own internal state stack.
364 365 366 367 368 369 370 371 372 |
# File 'lib/rouge/regex_lexer.rb', line 364 def delegate(lexer, text=nil) puts " delegating to #{lexer.inspect}" if @debug text ||= @current_stream[0] lexer.lex(text, :continue => true) do |tok, val| puts " delegated token: #{tok.inspect}, #{val.inspect}" if @debug yield_token(tok, val) end end |
#get_state(state_name) ⇒ Object
213 214 215 |
# File 'lib/rouge/regex_lexer.rb', line 213 def get_state(state_name) self.class.get_state(state_name) end |
#goto(state_name) ⇒ Object
replace the head of the stack with the given state
408 409 410 411 412 413 |
# File 'lib/rouge/regex_lexer.rb', line 408 def goto(state_name) raise 'empty stack!' if stack.empty? puts " going to state :#{state_name} " if @debug stack[-1] = get_state(state_name) end |
#group(tok) ⇒ Object
Yield a token with the next matched group. Subsequent calls to this method will yield subsequent groups.
343 344 345 |
# File 'lib/rouge/regex_lexer.rb', line 343 def group(tok) raise "RegexLexer#group is deprecated: use #groups instead" end |
#groups(*tokens) ⇒ Object
Yield tokens corresponding to the matched groups of the current match.
349 350 351 352 353 |
# File 'lib/rouge/regex_lexer.rb', line 349 def groups(*tokens) tokens.each_with_index do |tok, i| yield_token(tok, @current_stream[i+1]) end end |
#in_state?(state_name) ⇒ Boolean
Check if ‘state_name` is in the state stack.
423 424 425 426 427 428 |
# File 'lib/rouge/regex_lexer.rb', line 423 def in_state?(state_name) state_name = state_name.to_s stack.any? do |state| state.name == state_name.to_s end end |
#pop!(times = 1) ⇒ Object
Pop the state stack. If a number is passed in, it will be popped that number of times.
397 398 399 400 401 402 403 404 405 |
# File 'lib/rouge/regex_lexer.rb', line 397 def pop!(times=1) raise 'empty stack!' if stack.empty? puts " popping stack: #{times}" if @debug stack.pop(times) nil end |
#push(state_name = nil, &b) ⇒ Object
Push a state onto the stack. If no state name is given and you’ve passed a block, a state will be dynamically created using the StateDSL.
381 382 383 384 385 386 387 388 389 390 391 392 393 |
# File 'lib/rouge/regex_lexer.rb', line 381 def push(state_name=nil, &b) push_state = if state_name get_state(state_name) elsif block_given? StateDSL.new(b.inspect, &b).to_state(self.class) else # use the top of the stack by default self.state end puts " pushing :#{push_state.name}" if @debug stack.push(push_state) end |
#recurse(text = nil) ⇒ Object
374 375 376 |
# File 'lib/rouge/regex_lexer.rb', line 374 def recurse(text=nil) delegate(self.class, text) end |
#reset! ⇒ Object
reset this lexer to its initial state. This runs all of the start_procs.
234 235 236 237 238 239 240 241 242 |
# File 'lib/rouge/regex_lexer.rb', line 234 def reset! @stack = nil @current_stream = nil puts "start blocks" if @debug && self.class.start_procs.any? self.class.start_procs.each do |pr| instance_eval(&pr) end end |
#reset_stack ⇒ Object
reset the stack back to ‘[:root]`.
416 417 418 419 420 |
# File 'lib/rouge/regex_lexer.rb', line 416 def reset_stack puts ' resetting stack' if @debug stack.clear stack.push get_state(:root) end |
#stack ⇒ Object
The state stack. This is initially the single state ‘[:root]`. It is an error for this stack to be empty.
220 221 222 |
# File 'lib/rouge/regex_lexer.rb', line 220 def stack @stack ||= [get_state(:root)] end |
#state ⇒ Object
The current state - i.e. one on top of the state stack.
NB: if the state stack is empty, this will throw an error rather than returning nil.
228 229 230 |
# File 'lib/rouge/regex_lexer.rb', line 228 def state stack.last or raise 'empty stack!' end |
#state?(state_name) ⇒ Boolean
Check if ‘state_name` is the state on top of the state stack.
431 432 433 |
# File 'lib/rouge/regex_lexer.rb', line 431 def state?(state_name) state_name.to_s == state.name end |
#step(state, stream) ⇒ Object
Runs one step of the lex. Rules in the current state are tried until one matches, at which point its callback is called.
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
# File 'lib/rouge/regex_lexer.rb', line 289 def step(state, stream) state.rules.each do |rule| if rule.is_a?(State) puts " entering mixin #{rule.name}" if @debug return true if step(rule, stream) puts " exiting mixin #{rule.name}" if @debug else puts " trying #{rule.inspect}" if @debug # XXX HACK XXX # StringScanner's implementation of ^ is b0rken. # see http://bugs.ruby-lang.org/issues/7092 # TODO: this doesn't cover cases like /(a|^b)/, but it's # the most common, for now... next if rule.beginning_of_line && !stream.beginning_of_line? if (size = stream.skip(rule.re)) puts " got #{stream[0].inspect}" if @debug instance_exec(stream, &rule.callback) if size.zero? @null_steps += 1 if @null_steps > MAX_NULL_SCANS puts " too many scans without consuming the string!" if @debug return false end else @null_steps = 0 end return true end end end false end |
#stream_tokens(str, &b) ⇒ Object
This implements the lexer protocol, by yielding [token, value] pairs.
The process for lexing works as follows, until the stream is empty:
-
We look at the state on top of the stack (which by default is ‘[:root]`).
-
Each rule in that state is tried until one is successful. If one is found, that rule’s callback is evaluated - which may yield tokens and manipulate the state stack. Otherwise, one character is consumed with an ‘’Error’‘ token, and we continue at (1.)
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
# File 'lib/rouge/regex_lexer.rb', line 256 def stream_tokens(str, &b) stream = StringScanner.new(str) @current_stream = stream @output_stream = b @states = self.class.states @null_steps = 0 until stream.eos? if @debug puts "lexer: #{self.class.tag}" puts "stack: #{stack.map(&:name).map(&:to_sym).inspect}" puts "stream: #{stream.peek(20).inspect}" end success = step(state, stream) if !success puts " no match, yielding Error" if @debug b.call(Token::Tokens::Error, stream.getch) end end end |
#token(tok, val = ) ⇒ Object
Yield a token.
335 336 337 |
# File 'lib/rouge/regex_lexer.rb', line 335 def token(tok, val=@current_stream[0]) yield_token(tok, val) end |