Class: UV::AbstractTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/uv-rays/abstract_tokenizer.rb

Overview

AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.

Constant Summary collapse

DEFAULT_ENCODING =
'ASCII-8BIT'.freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ AbstractTokenizer

Returns a new instance of AbstractTokenizer.

Parameters:

  • options (Hash)

Raises:

  • (ArgumentError)


14
15
16
17
18
19
20
21
22
23
24
25
26
27
# File 'lib/uv-rays/abstract_tokenizer.rb', line 14

def initialize(options)
    @callback  = options[:callback]
    @indicator  = options[:indicator]
    @size_limit = options[:size_limit]
    @verbose  = options[:verbose] if @size_limit
    @encoding   = options[:encoding] || DEFAULT_ENCODING

    raise ArgumentError, 'no indicator provided' unless @indicator
    raise ArgumentError, 'no callback provided' unless @callback

    @input = ''
    @input.force_encoding(@encoding)
    @indicator.force_encoding(@encoding) if @indicator.is_a?(String)
end

Instance Attribute Details

#callbackObject

Returns the value of attribute callback.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def callback
  @callback
end

#indicatorObject

Returns the value of attribute indicator.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def indicator
  @indicator
end

#size_limitObject

Returns the value of attribute size_limit.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def size_limit
  @size_limit
end

#verboseObject

Returns the value of attribute verbose.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def verbose
  @verbose
end

Instance Method Details

#empty?Boolean

Returns:

  • (Boolean)


95
96
97
# File 'lib/uv-rays/abstract_tokenizer.rb', line 95

def empty?
    @input.empty?
end

#extract(data) ⇒ Object

Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator

Examples:


tokenizer.extract(data).
    map { |entity| Decode(entity) }.each { ... }

Parameters:

  • data (String)


38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'lib/uv-rays/abstract_tokenizer.rb', line 38

def extract(data)
    data.force_encoding(@encoding)
    @input << data

    entities = []

    loop do
        found = false

        check = @input.partition(@indicator)
        break unless check[1].length > 0

        last = check[2]
        result = @callback.call(last)
        if result
            found = true

            # Check for multi-byte indicator edge case
            if result.is_a? Fixnum
                entities << last[0...result]
                @input = last[result..-1]
            else
                entities << last
                reset
            end
        end

        break if not found
    end

    # Check to see if the buffer has exceeded capacity, if we're imposing a limit
    if @size_limit && @input.size > @size_limit
        if @indicator.respond_to?(:length) # check for regex
            # save enough of the buffer that if one character of the indicator were
            # missing we would match on next extract (very much an edge case) and
            # best we can do with a full buffer.
            @input = @input[-(@indicator.length - 1)..-1]
        else
            reset
        end
        raise 'input buffer exceeded limit' if @verbose
    end

    return entities
end

#flushString

Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.

Returns:

  • (String)


88
89
90
91
92
# File 'lib/uv-rays/abstract_tokenizer.rb', line 88

def flush
    buffer = @input
    reset
    buffer
end