Class: Dhaka::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/tokenizer/tokenizer.rb

Overview

This class contains a DSL for specifying tokenizers. Subclass it to implement tokenizers for specific grammars. Subclasses of this class may not be further subclassed.

Tokenizers are state machines that are specified pretty much by hand. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokenizer

:nodoc:



66
67
68
69
70
71
# File 'lib/tokenizer/tokenizer.rb', line 66

def initialize(input) #:nodoc:
  @input = input
  @current_state = self.class.states[TOKENIZER_IDLE_STATE]
  @curr_char_index = 0
  @tokens = []
end

Instance Attribute Details

#accumulatorObject

A slot that can be used to accumulate characters when processing multi-character tokens.



62
63
64
# File 'lib/tokenizer/tokenizer.rb', line 62

def accumulator
  @accumulator
end

#tokensObject (readonly)

The tokens shifted so far.



64
65
66
# File 'lib/tokenizer/tokenizer.rb', line 64

def tokens
  @tokens
end

Class Method Details

.for_state(state_name, &blk) ⇒ Object

Define the action for the state named state_name.



52
53
54
# File 'lib/tokenizer/tokenizer.rb', line 52

def self.for_state(state_name, &blk)
  states[state_name].instance_eval(&blk)
end

.tokenize(input) ⇒ Object

Tokenizes a string input and returns an array of Token-s.



57
58
59
# File 'lib/tokenizer/tokenizer.rb', line 57

def self.tokenize(input)
  self.new(input).run
end

Instance Method Details

#advanceObject

Advance to the next character.



79
80
81
# File 'lib/tokenizer/tokenizer.rb', line 79

def advance
  @curr_char_index += 1
end

#curr_charObject

The character currently being processed.



74
75
76
# File 'lib/tokenizer/tokenizer.rb', line 74

def curr_char
  @input[@curr_char_index] and @input[@curr_char_index].chr 
end

#runObject

:nodoc:



88
89
90
91
92
93
94
95
# File 'lib/tokenizer/tokenizer.rb', line 88

def run #:nodoc:
  while curr_char
    blk = @current_state.actions[curr_char]
    raise UnrecognizedInputCharacterException.new(@input, @curr_char_index) unless blk
    instance_eval(&blk)
  end
  tokens
end

#switch_to(state_name) ⇒ Object

Change the active state of the tokenizer to the state identified by the symbol state_name.



84
85
86
# File 'lib/tokenizer/tokenizer.rb', line 84

def switch_to state_name
  @current_state = self.class.states[state_name]
end