Class: UV::AbstractTokenizer
- Inherits:
-
Object
- Object
- UV::AbstractTokenizer
- Defined in:
- lib/uv-rays/abstract_tokenizer.rb
Overview
AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.
Constant Summary collapse
- DEFAULT_ENCODING =
'ASCII-8BIT'.freeze
Instance Attribute Summary collapse
-
#callback ⇒ Object
Returns the value of attribute callback.
-
#indicator ⇒ Object
Returns the value of attribute indicator.
-
#size_limit ⇒ Object
Returns the value of attribute size_limit.
-
#verbose ⇒ Object
Returns the value of attribute verbose.
Instance Method Summary collapse
- #empty? ⇒ Boolean
-
#extract(data) ⇒ Object
Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator.
-
#flush ⇒ String
Flush the contents of the input buffer, i.e.
-
#initialize(options) ⇒ AbstractTokenizer
constructor
A new instance of AbstractTokenizer.
Constructor Details
#initialize(options) ⇒ AbstractTokenizer
Returns a new instance of AbstractTokenizer.
14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 14 def initialize() @callback = [:callback] @indicator = [:indicator] @size_limit = [:size_limit] @verbose = [:verbose] if @size_limit @encoding = [:encoding] || DEFAULT_ENCODING raise ArgumentError, 'no indicator provided' unless @indicator raise ArgumentError, 'no callback provided' unless @callback @input = '' @input.force_encoding(@encoding) @indicator.force_encoding(@encoding) if @indicator.is_a?(String) end |
Instance Attribute Details
#callback ⇒ Object
Returns the value of attribute callback.
11 12 13 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11 def callback @callback end |
#indicator ⇒ Object
Returns the value of attribute indicator.
11 12 13 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11 def indicator @indicator end |
#size_limit ⇒ Object
Returns the value of attribute size_limit.
11 12 13 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11 def size_limit @size_limit end |
#verbose ⇒ Object
Returns the value of attribute verbose.
11 12 13 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 11 def verbose @verbose end |
Instance Method Details
#empty? ⇒ Boolean
112 113 114 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 112 def empty? @input.empty? end |
#extract(data) ⇒ Object
Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 38 def extract(data) data.force_encoding(@encoding) @input << data = @input.split(@indicator, -1) if .length > 1 .shift # the first item will always be junk last = .pop # the last item may require buffering entities = [] .each do |msg| result = @callback.call(msg) entities << msg[0...result] if result end # Check if buffering is required result = @callback.call(last) if result # Check for multi-byte indicator edge case if result.is_a? Fixnum entities << last[0...result] @input = last[result..-1] else reset entities << last end else # This will work with a regex index = if .last.nil? 0 else # Possible that rindex will not find a match check = @input[0...-last.length].rindex(.last) if check.nil? 0 else check + .last.length end end indicator_val = @input[index...-last.length] @input = indicator_val + last end else @input = .pop entities = end # Check to see if the buffer has exceeded capacity, if we're imposing a limit if @size_limit && @input.size > @size_limit if @indicator.respond_to?(:length) # check for regex # save enough of the buffer that if one character of the indicator were # missing we would match on next extract (very much an edge case) and # best we can do with a full buffer. @input = @input[-(@indicator.length - 1)..-1] else reset end raise 'input buffer exceeded limit' if @verbose end return entities end |
#flush ⇒ String
Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.
105 106 107 108 109 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 105 def flush buffer = @input reset buffer end |