Class: UV::AbstractTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/uv-rays/abstract_tokenizer.rb

Overview

AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.

Constant Summary collapse

DEFAULT_ENCODING =
'ASCII-8BIT'.freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ AbstractTokenizer

Returns a new instance of AbstractTokenizer.

Parameters:

  • options (Hash)

Raises:

  • (ArgumentError)


14
15
16
17
18
19
20
21
22
23
24
25
26
27
# File 'lib/uv-rays/abstract_tokenizer.rb', line 14

def initialize(options)
    @callback  = options[:callback]
    @indicator  = options[:indicator]
    @size_limit = options[:size_limit]
    @verbose  = options[:verbose] if @size_limit
    @encoding   = options[:encoding] || DEFAULT_ENCODING

    raise ArgumentError, 'no indicator provided' unless @indicator
    raise ArgumentError, 'no callback provided' unless @callback

    @input = ''
    @input.force_encoding(@encoding)
    @indicator.force_encoding(@encoding) if @indicator.is_a?(String)
end

Instance Attribute Details

#callbackObject

Returns the value of attribute callback.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def callback
  @callback
end

#indicatorObject

Returns the value of attribute indicator.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def indicator
  @indicator
end

#size_limitObject

Returns the value of attribute size_limit.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def size_limit
  @size_limit
end

#verboseObject

Returns the value of attribute verbose.



11
12
13
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11

def verbose
  @verbose
end

Instance Method Details

#empty?Boolean

Returns:

  • (Boolean)


112
113
114
# File 'lib/uv-rays/abstract_tokenizer.rb', line 112

def empty?
    @input.empty?
end

#extract(data) ⇒ Object

Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator

Examples:


tokenizer.extract(data).
    map { |entity| Decode(entity) }.each { ... }

Parameters:

  • data (String)


38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/uv-rays/abstract_tokenizer.rb', line 38

def extract(data)
    data.force_encoding(@encoding)
    @input << data

    messages = @input.split(@indicator, -1)
    if messages.length > 1
        messages.shift      # the first item will always be junk
        last = messages.pop # the last item may require buffering

        entities = []
        messages.each do |msg|
            result = @callback.call(msg)
            entities << msg[0...result] if result
        end

        # Check if buffering is required
        result = @callback.call(last)
        if result
            # Check for multi-byte indicator edge case
            if result.is_a? Fixnum
                entities << last[0...result]
                @input = last[result..-1]
            else
                reset
                entities << last
            end
        else
            # This will work with a regex
            index = if messages.last.nil?
                0
            else
                # Possible that rindex will not find a match
                check = @input[0...-last.length].rindex(messages.last)
                if check.nil?
                    0
                else
                    check + messages.last.length
                end
            end
            indicator_val = @input[index...-last.length]
            @input = indicator_val + last
        end
    else
        @input = messages.pop
        entities = messages
    end

    # Check to see if the buffer has exceeded capacity, if we're imposing a limit
    if @size_limit && @input.size > @size_limit
        if @indicator.respond_to?(:length) # check for regex
            # save enough of the buffer that if one character of the indicator were
            # missing we would match on next extract (very much an edge case) and
            # best we can do with a full buffer.
            @input = @input[-(@indicator.length - 1)..-1]
        else
            reset
        end
        raise 'input buffer exceeded limit' if @verbose
    end

    return entities
end

#flushString

Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.

Returns:

  • (String)


105
106
107
108
109
# File 'lib/uv-rays/abstract_tokenizer.rb', line 105

def flush
    buffer = @input
    reset
    buffer
end