Class: Rouge::RegexLexer Abstract

Inherits:
Lexer
  • Object
show all
Defined in:
lib/rouge/regex_lexer.rb

Overview

This class is abstract.

A stateful lexer that uses sets of regular expressions to tokenize a string. Most lexers are instances of RegexLexer.

Defined Under Namespace

Classes: Rule, State, StateDSL

Constant Summary collapse

MAX_NULL_SCANS =

The number of successive scans permitted without consuming the input stream. If this is exceeded, the match fails.

5

Constants included from Token::Tokens

Token::Tokens::Num, Token::Tokens::Str

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Lexer

aliases, all, analyze_text, assert_utf8!, #debug, default_options, demo, demo_file, desc, filenames, find, find_fancy, guess, guess_by_filename, guess_by_mimetype, guess_by_source, guesses, #initialize, #lex, lex, mimetypes, #option, #options, tag, #tag

Methods included from Token::Tokens

token

Constructor Details

This class inherits a constructor from Rouge::Lexer

Class Method Details

.append(state, &b) ⇒ Object


173
174
175
176
177
# File 'lib/rouge/regex_lexer.rb', line 173

def self.append(state, &b)
  name = name.to_s
  dsl = state_definitions[name] or raise "no such state #{name.inspect}"
  replace_state(name, dsl.appended(&b))
end

.get_state(name) ⇒ Object


180
181
182
183
184
185
186
187
188
189
# File 'lib/rouge/regex_lexer.rb', line 180

def self.get_state(name)
  return name if name.is_a? State

  name = name.to_s

  states[name] ||= begin
    defn = state_definitions[name] or raise "unknown state: #{name.inspect}"
    defn.to_state(self)
  end
end

.prepend(name, &b) ⇒ Object


167
168
169
170
171
# File 'lib/rouge/regex_lexer.rb', line 167

def self.prepend(name, &b)
  name = name.to_s
  dsl = state_definitions[name] or raise "no such state #{name.inspect}"
  replace_state(name, dsl.prepended(&b))
end

.replace_state(name, new_defn) ⇒ Object


140
141
142
143
# File 'lib/rouge/regex_lexer.rb', line 140

def self.replace_state(name, new_defn)
  states[name] = nil
  state_definitions[name] = new_defn
end

.start(&b) ⇒ Object

Specify an action to be run every fresh lex.

Examples:

start { puts "I'm lexing a new string!" }

156
157
158
# File 'lib/rouge/regex_lexer.rb', line 156

def self.start(&b)
  start_procs << b
end

.start_procsObject

The routines to run at the beginning of a fresh lex.

See Also:


147
148
149
# File 'lib/rouge/regex_lexer.rb', line 147

def self.start_procs
  @start_procs ||= InheritableList.new(superclass.start_procs)
end

.state(name, &b) ⇒ Object

Define a new state for this lexer with the given name. The block will be evaluated in the context of a StateDSL.


162
163
164
165
# File 'lib/rouge/regex_lexer.rb', line 162

def self.state(name, &b)
  name = name.to_s
  state_definitions[name] = StateDSL.new(name, &b)
end

.state_definitionsObject


135
136
137
# File 'lib/rouge/regex_lexer.rb', line 135

def self.state_definitions
  @state_definitions ||= InheritableHash.new(superclass.state_definitions)
end

.statesObject

The states hash for this lexer.

See Also:


131
132
133
# File 'lib/rouge/regex_lexer.rb', line 131

def self.states
  @states ||= {}
end

Instance Method Details

#delegate(lexer, text = nil) ⇒ Object

Delegate the lex to another lexer. The #lex method will be called with `:continue` set to true, so that #reset! will not be called. In this way, a single lexer can be repeatedly delegated to while maintaining its own internal state stack.

Parameters:

  • lexer (#lex)

    The lexer or lexer class to delegate to

  • text (String) (defaults to: nil)

    The text to delegate. This defaults to the last matched string.


350
351
352
353
354
355
356
357
358
# File 'lib/rouge/regex_lexer.rb', line 350

def delegate(lexer, text=nil)
  debug { "    delegating to #{lexer.inspect}" }
  text ||= @current_stream[0]

  lexer.lex(text, :continue => true) do |tok, val|
    debug { "    delegated token: #{tok.inspect}, #{val.inspect}" }
    yield_token(tok, val)
  end
end

#get_state(state_name) ⇒ Object


192
193
194
# File 'lib/rouge/regex_lexer.rb', line 192

def get_state(state_name)
  self.class.get_state(state_name)
end

#goto(state_name) ⇒ Object

replace the head of the stack with the given state


394
395
396
397
# File 'lib/rouge/regex_lexer.rb', line 394

def goto(state_name)
  raise 'empty stack!' if stack.empty?
  stack[-1] = get_state(state_name)
end

#group(tok) ⇒ Object

Yield a token with the next matched group. Subsequent calls to this method will yield subsequent groups.


331
332
333
# File 'lib/rouge/regex_lexer.rb', line 331

def group(tok)
  yield_token(tok, @current_stream[@group_count += 1])
end

#groups(*tokens) ⇒ Object


335
336
337
338
339
# File 'lib/rouge/regex_lexer.rb', line 335

def groups(*tokens)
  tokens.each_with_index do |tok, i|
    yield_token(tok, @current_stream[i+1])
  end
end

#in_state?(state_name) ⇒ Boolean

Check if `state_name` is in the state stack.

Returns:

  • (Boolean)

407
408
409
410
411
412
# File 'lib/rouge/regex_lexer.rb', line 407

def in_state?(state_name)
  state_name = state_name.to_s
  stack.any? do |state|
    state.name == state_name.to_s
  end
end

#pop!(times = 1) ⇒ Object

Pop the state stack. If a number is passed in, it will be popped that number of times.


383
384
385
386
387
388
389
390
391
# File 'lib/rouge/regex_lexer.rb', line 383

def pop!(times=1)
  raise 'empty stack!' if stack.empty?

  debug { "    popping stack: #{times}" }

  stack.pop(times)

  nil
end

#push(state_name = nil, &b) ⇒ Object

Push a state onto the stack. If no state name is given and you've passed a block, a state will be dynamically created using the StateDSL.


367
368
369
370
371
372
373
374
375
376
377
378
379
# File 'lib/rouge/regex_lexer.rb', line 367

def push(state_name=nil, &b)
  push_state = if state_name
    get_state(state_name)
  elsif block_given?
    StateDSL.new(b.inspect, &b).to_state(self.class)
  else
    # use the top of the stack by default
    self.state
  end

  debug { "    pushing #{push_state.name}" }
  stack.push(push_state)
end

#recurse(text = nil) ⇒ Object


360
361
362
# File 'lib/rouge/regex_lexer.rb', line 360

def recurse(text=nil)
  delegate(self.class, text)
end

#reset!Object

reset this lexer to its initial state. This runs all of the start_procs.


213
214
215
216
217
218
219
220
# File 'lib/rouge/regex_lexer.rb', line 213

def reset!
  @stack = nil
  @current_stream = nil

  self.class.start_procs.each do |pr|
    instance_eval(&pr)
  end
end

#reset_stackObject

reset the stack back to `[:root]`.


400
401
402
403
404
# File 'lib/rouge/regex_lexer.rb', line 400

def reset_stack
  debug { '    resetting stack' }
  stack.clear
  stack.push get_state(:root)
end

#run_callback(stream, callback, &output_stream) ⇒ Object


281
282
283
284
285
286
# File 'lib/rouge/regex_lexer.rb', line 281

def run_callback(stream, callback, &output_stream)
  with_output_stream(output_stream) do
    @group_count = 0
    instance_exec(stream, &callback)
  end
end

#run_rule(rule, scanner, &b) ⇒ Object


293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# File 'lib/rouge/regex_lexer.rb', line 293

def run_rule(rule, scanner, &b)
  # XXX HACK XXX
  # StringScanner's implementation of ^ is b0rken.
  # see http://bugs.ruby-lang.org/issues/7092
  # TODO: this doesn't cover cases like /(a|^b)/, but it's
  # the most common, for now...
  return false if rule.beginning_of_line? && !scanner.beginning_of_line?

  if (@null_steps ||= 0) >= MAX_NULL_SCANS
    debug { "    too many scans without consuming the string!" }
    return false
  end

  scanner.scan(rule.re) or return false

  if scanner.matched_size.zero?
    @null_steps += 1
  else
    @null_steps = 0
  end

  true
end

#stackObject

The state stack. This is initially the single state `[:root]`. It is an error for this stack to be empty.

See Also:


199
200
201
# File 'lib/rouge/regex_lexer.rb', line 199

def stack
  @stack ||= [get_state(:root)]
end

#stateObject

The current state - i.e. one on top of the state stack.

NB: if the state stack is empty, this will throw an error rather than returning nil.


207
208
209
# File 'lib/rouge/regex_lexer.rb', line 207

def state
  stack.last or raise 'empty stack!'
end

#state?(state_name) ⇒ Boolean

Check if `state_name` is the state on top of the state stack.

Returns:

  • (Boolean)

415
416
417
# File 'lib/rouge/regex_lexer.rb', line 415

def state?(state_name)
  state_name.to_s == state.name
end

#step(state, stream, &b) ⇒ Object

Runs one step of the lex. Rules in the current state are tried until one matches, at which point its callback is called.

Returns:

  • true if a rule was tried successfully

  • false otherwise.


257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# File 'lib/rouge/regex_lexer.rb', line 257

def step(state, stream, &b)
  state.rules.each do |rule|
    case rule
    when State
      debug { "  entering mixin #{rule.name}" }
      return true if step(rule, stream, &b)
      debug { "  exiting  mixin #{rule.name}" }
    when Rule
      debug { "  trying #{rule.inspect}" }

      if run_rule(rule, stream)
        debug { "    got #{stream[0].inspect}" }

        run_callback(stream, rule.callback, &b)

        return true
      end
    end
  end

  false
end

#stream_tokens(str, &b) ⇒ Object

This implements the lexer protocol, by yielding [token, value] pairs.

The process for lexing works as follows, until the stream is empty:

  1. We look at the state on top of the stack (which by default is `[:root]`).

  2. Each rule in that state is tried until one is successful. If one is found, that rule's callback is evaluated - which may yield tokens and manipulate the state stack. Otherwise, one character is consumed with an `'Error'` token, and we continue at (1.)


234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# File 'lib/rouge/regex_lexer.rb', line 234

def stream_tokens(str, &b)
  stream = StringScanner.new(str)

  @current_stream = stream

  until stream.eos?
    debug { "lexer: #{self.class.tag}" }
    debug { "stack: #{stack.map(&:name).inspect}" }
    debug { "stream: #{stream.peek(20).inspect}" }
    success = step(get_state(state), stream, &b)

    if !success
      debug { "    no match, yielding Error" }
      b.call(Token::Tokens::Error, stream.getch)
    end
  end
end

#token(tok, val = :__absent__) ⇒ Object

Yield a token.

Parameters:

  • tok

    the token type

  • val (defaults to: :__absent__)

    (optional) the string value to yield. If absent, this defaults to the entire last match.


324
325
326
327
# File 'lib/rouge/regex_lexer.rb', line 324

def token(tok, val=:__absent__)
  val = @current_stream[0] if val == :__absent__
  yield_token(tok, val)
end