Class: UV::AbstractTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/uv-rays/abstract_tokenizer.rb

Overview

AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.

Constant Summary collapse

DEFAULT_ENCODING =
'ASCII-8BIT'

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ AbstractTokenizer

Returns a new instance of AbstractTokenizer.

Raises:

  • (ArgumentError)


16
17
18
19
20
21
22
23
24
25
26
27
28
29
# File 'lib/uv-rays/abstract_tokenizer.rb', line 16

def initialize(options)
    @callback  = options[:callback]
    @indicator  = options[:indicator]
    @size_limit = options[:size_limit]
    @verbose  = options[:verbose] if @size_limit
    @encoding   = options[:encoding] || DEFAULT_ENCODING

    raise ArgumentError, 'no callback provided' unless @callback

    reset
    if @indicator.is_a?(String)
        @indicator = String.new(@indicator).force_encoding(@encoding).freeze
    end
end

Instance Attribute Details

#callbackObject

Returns the value of attribute callback.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def callback
  @callback
end

#indicatorObject

Returns the value of attribute indicator.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def indicator
  @indicator
end

#size_limitObject

Returns the value of attribute size_limit.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def size_limit
  @size_limit
end

#verboseObject

Returns the value of attribute verbose.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def verbose
  @verbose
end

Instance Method Details

#bytesizeInteger



109
110
111
# File 'lib/uv-rays/abstract_tokenizer.rb', line 109

def bytesize
    @input.bytesize
end

#empty?Boolean



104
105
106
# File 'lib/uv-rays/abstract_tokenizer.rb', line 104

def empty?
    @input.empty?
end

#extract(data) ⇒ Object

Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator

Examples:


tokenizer.extract(data).
    map { |entity| Decode(entity) }.each { ... }


40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/uv-rays/abstract_tokenizer.rb', line 40

def extract(data)
    data.force_encoding(@encoding)
    @input << data

    entities = []

    loop do
        found = false

        last = if @indicator
            check = @input.partition(@indicator)
            break unless check[1].length > 0

            check[2]
        else
            @input
        end

        result = @callback.call(last)

        if result
            found = true

            # Check for multi-byte indicator edge case
            case result
            when Integer
                entities << last[0...result]
                @input = last[result..-1]
            else
                entities << last
                reset
            end
        end

        break if not found
    end

    # Check to see if the buffer has exceeded capacity, if we're imposing a limit
    if @size_limit && @input.size > @size_limit
        if @indicator.respond_to?(:length) # check for regex
            # save enough of the buffer that if one character of the indicator were
            # missing we would match on next extract (very much an edge case) and
            # best we can do with a full buffer.
            @input = @input[-(@indicator.length - 1)..-1]
        else
            reset
        end
        raise 'input buffer exceeded limit' if @verbose
    end

    return entities
end

#flushString

Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.



97
98
99
100
101
# File 'lib/uv-rays/abstract_tokenizer.rb', line 97

def flush
    buffer = @input
    reset
    buffer
end