Class: UV::AbstractTokenizer
- Inherits:
-
Object
- Object
- UV::AbstractTokenizer
- Defined in:
- lib/uv-rays/abstract_tokenizer.rb
Overview
AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.
Constant Summary collapse
- DEFAULT_ENCODING =
'ASCII-8BIT'
Instance Attribute Summary collapse
-
#callback ⇒ Object
Returns the value of attribute callback.
-
#indicator ⇒ Object
Returns the value of attribute indicator.
-
#size_limit ⇒ Object
Returns the value of attribute size_limit.
-
#verbose ⇒ Object
Returns the value of attribute verbose.
Instance Method Summary collapse
- #bytesize ⇒ Integer
- #empty? ⇒ Boolean
-
#extract(data) ⇒ Object
Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator.
-
#flush ⇒ String
Flush the contents of the input buffer, i.e.
-
#initialize(options) ⇒ AbstractTokenizer
constructor
A new instance of AbstractTokenizer.
Constructor Details
#initialize(options) ⇒ AbstractTokenizer
Returns a new instance of AbstractTokenizer.
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 16 def initialize() @callback = [:callback] @indicator = [:indicator] @size_limit = [:size_limit] @verbose = [:verbose] if @size_limit @encoding = [:encoding] || DEFAULT_ENCODING raise ArgumentError, 'no callback provided' unless @callback reset if @indicator.is_a?(String) @indicator = String.new(@indicator).force_encoding(@encoding).freeze end end |
Instance Attribute Details
#callback ⇒ Object
Returns the value of attribute callback.
13 14 15 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13 def callback @callback end |
#indicator ⇒ Object
Returns the value of attribute indicator.
13 14 15 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13 def indicator @indicator end |
#size_limit ⇒ Object
Returns the value of attribute size_limit.
13 14 15 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13 def size_limit @size_limit end |
#verbose ⇒ Object
Returns the value of attribute verbose.
13 14 15 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 13 def verbose @verbose end |
Instance Method Details
#bytesize ⇒ Integer
109 110 111 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 109 def bytesize @input.bytesize end |
#empty? ⇒ Boolean
104 105 106 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 104 def empty? @input.empty? end |
#extract(data) ⇒ Object
Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 40 def extract(data) data.force_encoding(@encoding) @input << data entities = [] loop do found = false last = if @indicator check = @input.partition(@indicator) break unless check[1].length > 0 check[2] else @input end result = @callback.call(last) if result found = true # Check for multi-byte indicator edge case case result when Integer entities << last[0...result] @input = last[result..-1] else entities << last reset end end break if not found end # Check to see if the buffer has exceeded capacity, if we're imposing a limit if @size_limit && @input.size > @size_limit if @indicator.respond_to?(:length) # check for regex # save enough of the buffer that if one character of the indicator were # missing we would match on next extract (very much an edge case) and # best we can do with a full buffer. @input = @input[-(@indicator.length - 1)..-1] else reset end raise 'input buffer exceeded limit' if @verbose end return entities end |
#flush ⇒ String
Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.
97 98 99 100 101 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 97 def flush buffer = @input reset buffer end |