Class: Rouge::RegexLexer Abstract

Inherits:
Lexer
  • Object
show all
Defined in:
lib/rouge/regex_lexer.rb

Overview

This class is abstract.

A stateful lexer that uses sets of regular expressions to tokenize a string. Most lexers are instances of RegexLexer.

Defined Under Namespace

Classes: Rule, State, StateDSL

Constant Summary collapse

MAX_NULL_SCANS =

The number of successive scans permitted without consuming the input stream. If this is exceeded, the match fails.

5

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Lexer

aliases, all, analyze_text, assert_utf8!, #debug, default_options, demo, demo_file, desc, filenames, find, find_fancy, guess, guess_by_filename, guess_by_mimetype, guess_by_source, #initialize, lex, #lex, mimetypes, #option, #options, register, #tag, tag

Constructor Details

This class inherits a constructor from Rouge::Lexer

Class Method Details

.get_state(name) ⇒ Object



157
158
159
160
161
162
163
# File 'lib/rouge/regex_lexer.rb', line 157

def self.get_state(name)
  return name if name.is_a? State

  state = states[name.to_s]
  raise "unknown state: #{name}" unless state
  state.load!
end

.postprocess(toktype) {|tok, val| ... } ⇒ Object

Specify a filter to be applied as the lexer yields tokens.

Parameters:

  • toktype

    The token type to postprocess

Yields:

  • (tok, val)

    The token and the matched value. The block will be evaluated in the context of the lexer, and it must yield an equivalent token/value pair, usually by calling #token.



138
139
140
# File 'lib/rouge/regex_lexer.rb', line 138

def self.postprocess(toktype, &b)
  postprocesses << [Token[toktype], b]
end

.postprocessesObject

where the postprocess blocks are stored.

See Also:



144
145
146
# File 'lib/rouge/regex_lexer.rb', line 144

def self.postprocesses
  @postprocesses ||= InheritableList.new(superclass.postprocesses)
end

.start(&b) ⇒ Object

Specify an action to be run every fresh lex.

Examples:

start { puts "I'm lexing a new string!" }


126
127
128
# File 'lib/rouge/regex_lexer.rb', line 126

def self.start(&b)
  start_procs << b
end

.start_procsObject

The routines to run at the beginning of a fresh lex.

See Also:



117
118
119
# File 'lib/rouge/regex_lexer.rb', line 117

def self.start_procs
  @start_procs ||= InheritableList.new(superclass.start_procs)
end

.state(name, &b) ⇒ Object

Define a new state for this lexer with the given name. The block will be evaluated in the context of a StateDSL.



151
152
153
154
# File 'lib/rouge/regex_lexer.rb', line 151

def self.state(name, &b)
  name = name.to_s
  states[name] = State.new(name, &b)
end

.statesObject

The states hash for this lexer.

See Also:



111
112
113
# File 'lib/rouge/regex_lexer.rb', line 111

def self.states
  @states ||= {}
end

Instance Method Details

#delegate(lexer, text = nil) ⇒ Object

Delegate the lex to another lexer. The #lex method will be called with ‘:continue` set to true, so that #reset! will not be called. In this way, a single lexer can be repeatedly delegated to while maintaining its own internal state stack.

Parameters:

  • lexer (#lex)

    The lexer or lexer class to delegate to

  • text (String) (defaults to: nil)

    The text to delegate. This defaults to the last matched string.



344
345
346
347
348
349
350
351
352
# File 'lib/rouge/regex_lexer.rb', line 344

def delegate(lexer, text=nil)
  debug { "    delegating to #{lexer.inspect}" }
  text ||= @last_match[0]

  lexer.lex(text, :continue => true) do |tok, val|
    debug { "    delegated token: #{tok.inspect}, #{val.inspect}" }
    token(tok, val)
  end
end

#get_state(state_name) ⇒ Object



166
167
168
# File 'lib/rouge/regex_lexer.rb', line 166

def get_state(state_name)
  self.class.get_state(state_name)
end

#group(tok) ⇒ Object

Yield a token with the next matched group. Subsequent calls to this method will yield subsequent groups.



331
332
333
# File 'lib/rouge/regex_lexer.rb', line 331

def group(tok)
  token(tok, @last_match[@group_count += 1])
end

#in_state?(state_name) ⇒ Boolean

Check if ‘state_name` is in the state stack.

Returns:

  • (Boolean)


388
389
390
# File 'lib/rouge/regex_lexer.rb', line 388

def in_state?(state_name)
  stack.map(&:name).include? state_name.to_s
end

#pop!(times = 1) ⇒ Object

Pop the state stack. If a number is passed in, it will be popped that number of times.



373
374
375
376
377
378
# File 'lib/rouge/regex_lexer.rb', line 373

def pop!(times=1)
  raise 'empty stack!' if stack.empty?

  debug { "    popping stack: #{times}" }
  times.times { stack.pop }
end

#push(state_name = nil, &b) ⇒ Object

Push a state onto the stack. If no state name is given and you’ve passed a block, a state will be dynamically created using the StateDSL.



357
358
359
360
361
362
363
364
365
366
367
368
369
# File 'lib/rouge/regex_lexer.rb', line 357

def push(state_name=nil, &b)
  push_state = if state_name
    get_state(state_name)
  elsif block_given?
    State.new(b.inspect, &b).load!
  else
    # use the top of the stack by default
    self.state
  end

  debug { "    pushing #{push_state.name}" }
  stack.push(push_state)
end

#reset!Object

reset this lexer to its initial state. This runs all of the start_procs.



187
188
189
190
191
192
193
# File 'lib/rouge/regex_lexer.rb', line 187

def reset!
  @stack = nil

  self.class.start_procs.each do |pr|
    instance_eval(&pr)
  end
end

#reset_stackObject

reset the stack back to ‘[:root]`.



381
382
383
384
385
# File 'lib/rouge/regex_lexer.rb', line 381

def reset_stack
  debug { '    resetting stack' }
  stack.clear
  stack.push get_state(:root)
end

#run_callback(stream, callback, &output_stream) ⇒ Object



275
276
277
278
279
280
281
282
# File 'lib/rouge/regex_lexer.rb', line 275

def run_callback(stream, callback, &output_stream)
  with_output_stream(output_stream) do
    @group_count = 0
    @last_match = stream
    instance_exec(stream, &callback)
    @last_match = nil
  end
end

#run_rule(rule, stream, &b) ⇒ Object



250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
# File 'lib/rouge/regex_lexer.rb', line 250

def run_rule(rule, stream, &b)
  case rule
  when String
    debug { "  entering mixin #{rule}" }
    res = step(get_state(rule), stream, &b)
    debug { "  exiting  mixin #{rule}" }
    res
  when Rule
    debug { "  trying #{rule.inspect}" }
    # XXX HACK XXX
    # StringScanner's implementation of ^ is b0rken.
    # see http://bugs.ruby-lang.org/issues/7092
    # TODO: this doesn't cover cases like /(a|^b)/, but it's
    # the most common, for now...
    return false if rule.beginning_of_line? && !stream.beginning_of_line?

    scan(stream, rule.re) do
      debug { "    got #{stream[0].inspect}" }

      run_callback(stream, rule.callback, &b)
    end
  end
end

#scan(scanner, re, &b) ⇒ Object



289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# File 'lib/rouge/regex_lexer.rb', line 289

def scan(scanner, re, &b)
  @null_steps ||= 0

  if @null_steps >= MAX_NULL_SCANS
    debug { "    too many scans without consuming the string!" }
    return false
  end

  scanner.scan(re)

  if scanner.matched?
    if scanner.matched_size == 0
      @null_steps += 1
    else
      @null_steps = 0
    end

    yield self
    return true
  end

  return false
end

#stackObject

The state stack. This is initially the single state ‘[:root]`. It is an error for this stack to be empty.

See Also:



173
174
175
# File 'lib/rouge/regex_lexer.rb', line 173

def stack
  @stack ||= [get_state(:root)]
end

#stateObject

The current state - i.e. one on top of the state stack.

NB: if the state stack is empty, this will throw an error rather than returning nil.



181
182
183
# File 'lib/rouge/regex_lexer.rb', line 181

def state
  stack.last or raise 'empty stack!'
end

#state?(state_name) ⇒ Boolean

Check if ‘state_name` is the state on top of the state stack.

Returns:

  • (Boolean)


393
394
395
# File 'lib/rouge/regex_lexer.rb', line 393

def state?(state_name)
  state_name.to_s == state.name
end

#step(state, stream, &b) ⇒ Object

Runs one step of the lex. Rules in the current state are tried until one matches, at which point its callback is called.

Returns:

  • true if a rule was tried successfully

  • false otherwise.



241
242
243
244
245
246
247
# File 'lib/rouge/regex_lexer.rb', line 241

def step(state, stream, &b)
  state.rules.each do |rule|
    return true if run_rule(rule, stream, &b)
  end

  false
end

#stream_tokens(stream, &b) ⇒ Object

This implements the lexer protocol, by yielding [token, value] pairs.

The process for lexing works as follows, until the stream is empty:

  1. We look at the state on top of the stack (which by default is ‘[:root]`).

  2. Each rule in that state is tried until one is successful. If one is found, that rule’s callback is evaluated - which may yield tokens and manipulate the state stack. Otherwise, one character is consumed with an ‘’Error’‘ token, and we continue at (1.)



207
208
209
210
211
212
213
214
215
216
217
218
219
# File 'lib/rouge/regex_lexer.rb', line 207

def stream_tokens(stream, &b)
  stream_without_postprocessing(stream) do |tok, val|
    _, processor = self.class.postprocesses.find { |t, _| t === tok }

    if processor
      with_output_stream(b) do
        instance_exec(tok, val, &processor)
      end
    else
      yield tok, val
    end
  end
end

#stream_without_postprocessing(stream, &b) ⇒ Object



222
223
224
225
226
227
228
229
230
231
232
233
234
# File 'lib/rouge/regex_lexer.rb', line 222

def stream_without_postprocessing(stream, &b)
  until stream.eos?
    debug { "lexer: #{self.class.tag}" }
    debug { "stack: #{stack.map(&:name).inspect}" }
    debug { "stream: #{stream.peek(20).inspect}" }
    success = step(get_state(state), stream, &b)

    if !success
      debug { "    no match, yielding Error" }
      b.call(Token['Error'], stream.getch)
    end
  end
end

#token(tok, val = :__absent__) ⇒ Object

Yield a token.

Parameters:

  • tok

    the token type

  • val (defaults to: :__absent__)

    (optional) the string value to yield. If absent, this defaults to the entire last match.



320
321
322
323
324
325
326
327
# File 'lib/rouge/regex_lexer.rb', line 320

def token(tok, val=:__absent__)
  val = @last_match[0] if val == :__absent__
  val ||= ''

  raise 'no output stream' unless @output_stream

  @output_stream << [Token[tok], val] unless val.empty?
end