Class: Dhaka::Tokenizer

Inherits:

Object

Object
Dhaka::Tokenizer

Defined in:: lib/tokenizer/tokenizer.rb

Overview

This class contains a DSL for specifying tokenizers. Subclass it to implement tokenizers for specific grammars. Subclasses of this class may not be further subclassed.

Tokenizers are state machines that are specified pretty much by hand. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

Instance Attribute Summary collapse

#accumulator ⇒ Object

A slot that can be used to accumulate characters when processing multi-character tokens.
#tokens ⇒ Object readonly

The tokens shifted so far.

Class Method Summary collapse

.for_state(state_name, &blk) ⇒ Object

Define the action for the state named state_name.
.tokenize(input) ⇒ Object

Tokenizes a string input and returns an array of Token-s.

Instance Method Summary collapse

#advance ⇒ Object

Advance to the next character.
#curr_char ⇒ Object

The character currently being processed.
#initialize(input) ⇒ Tokenizer constructor

:nodoc:.
#run ⇒ Object

:nodoc:.
#switch_to(state_name) ⇒ Object

Change the active state of the tokenizer to the state identified by the symbol state_name.

Constructor Details

#initialize(input) ⇒ `Tokenizer`

:nodoc:

# File 'lib/tokenizer/tokenizer.rb', line 66

def initialize(input) #:nodoc:
  @input = input
  @current_state = self.class.states[TOKENIZER_IDLE_STATE]
  @curr_char_index = 0
  @tokens = []
end

Instance Attribute Details

#accumulator ⇒ `Object`

A slot that can be used to accumulate characters when processing multi-character tokens.



62
63
64

# File 'lib/tokenizer/tokenizer.rb', line 62

def accumulator
  @accumulator
end

#tokens ⇒ `Object` (readonly)

The tokens shifted so far.



64
65
66

# File 'lib/tokenizer/tokenizer.rb', line 64

def tokens
  @tokens
end

Class Method Details

.for_state(state_name, &blk) ⇒ `Object`

Define the action for the state named state_name.



52
53
54

# File 'lib/tokenizer/tokenizer.rb', line 52

def self.for_state(state_name, &blk)
  states[state_name].instance_eval(&blk)
end

.tokenize(input) ⇒ `Object`

Tokenizes a string input and returns an array of Token-s.



57
58
59

# File 'lib/tokenizer/tokenizer.rb', line 57

def self.tokenize(input)
  self.new(input).run
end

Instance Method Details

#advance ⇒ `Object`

Advance to the next character.



79
80
81

# File 'lib/tokenizer/tokenizer.rb', line 79

def advance
  @curr_char_index += 1
end

#curr_char ⇒ `Object`

The character currently being processed.



74
75
76

# File 'lib/tokenizer/tokenizer.rb', line 74

def curr_char
  @input[@curr_char_index] and @input[@curr_char_index].chr 
end

#run ⇒ `Object`

:nodoc:

# File 'lib/tokenizer/tokenizer.rb', line 88

def run #:nodoc:
  while curr_char
    blk = @current_state.actions[curr_char]
    raise UnrecognizedInputCharacterException.new(@input, @curr_char_index) unless blk
    instance_eval(&blk)
  end
  tokens
end

#switch_to(state_name) ⇒ `Object`

Change the active state of the tokenizer to the state identified by the symbol state_name.



84
85
86

# File 'lib/tokenizer/tokenizer.rb', line 84

def switch_to state_name
  @current_state = self.class.states[state_name]
end

Class: Dhaka::Tokenizer

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokenizer

Instance Attribute Details

#accumulator ⇒ Object

#tokens ⇒ Object (readonly)

Class Method Details

.for_state(state_name, &blk) ⇒ Object

.tokenize(input) ⇒ Object