Class: UV::AbstractTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/uv-rays/abstract_tokenizer.rb

Overview

AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.

Constant Summary collapse

DEFAULT_ENCODING =
'ASCII-8BIT'

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ AbstractTokenizer

Returns a new instance of AbstractTokenizer.

Parameters:

  • options (Hash)

Raises:

  • (ArgumentError)


16
17
18
19
20
21
22
23
24
25
26
27
28
29
# File 'lib/uv-rays/abstract_tokenizer.rb', line 16

def initialize(options)
    @callback  = options[:callback]
    @indicator  = options[:indicator]
    @size_limit = options[:size_limit]
    @verbose  = options[:verbose] if @size_limit
    @encoding   = options[:encoding] || DEFAULT_ENCODING

    raise ArgumentError, 'no callback provided' unless @callback

    reset
    if @indicator.is_a?(String)
        @indicator = String.new(@indicator).force_encoding(@encoding).freeze
    end
end

Instance Attribute Details

#callbackObject

Returns the value of attribute callback.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def callback
  @callback
end

#indicatorObject

Returns the value of attribute indicator.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def indicator
  @indicator
end

#size_limitObject

Returns the value of attribute size_limit.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def size_limit
  @size_limit
end

#verboseObject

Returns the value of attribute verbose.



13
14
15
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13

def verbose
  @verbose
end

Instance Method Details

#bytesizeInteger

Returns:

  • (Integer)


109
110
111
# File 'lib/uv-rays/abstract_tokenizer.rb', line 109

def bytesize
    @input.bytesize
end

#empty?Boolean

Returns:

  • (Boolean)


104
105
106
# File 'lib/uv-rays/abstract_tokenizer.rb', line 104

def empty?
    @input.empty?
end

#extract(data) ⇒ Object

Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator

Examples:


tokenizer.extract(data).
    map { |entity| Decode(entity) }.each { ... }

Parameters:

  • data (String)


40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/uv-rays/abstract_tokenizer.rb', line 40

def extract(data)
    data.force_encoding(@encoding)
    @input << data

    entities = []

    loop do
        found = false

        last = if @indicator
            check = @input.partition(@indicator)
            break unless check[1].length > 0

            check[2]
        else
            @input
        end

        result = @callback.call(last)

        if result
            found = true

            # Check for multi-byte indicator edge case
            case result
            when Integer
                entities << last[0...result]
                @input = last[result..-1]
            else
                entities << last
                reset
            end
        end

        break if not found
    end

    # Check to see if the buffer has exceeded capacity, if we're imposing a limit
    if @size_limit && @input.size > @size_limit
        if @indicator.respond_to?(:length) # check for regex
            # save enough of the buffer that if one character of the indicator were
            # missing we would match on next extract (very much an edge case) and
            # best we can do with a full buffer.
            @input = @input[-(@indicator.length - 1)..-1]
        else
            reset
        end
        raise 'input buffer exceeded limit' if @verbose
    end

    return entities
end

#flushString

Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.

Returns:

  • (String)


97
98
99
100
101
# File 'lib/uv-rays/abstract_tokenizer.rb', line 97

def flush
    buffer = @input
    reset
    buffer
end