Class: UV::AbstractTokenizer
- Inherits:
-
Object
- Object
- UV::AbstractTokenizer
- Defined in:
- lib/uv-rays/abstract_tokenizer.rb
Overview
AbstractTokenizer is similar to BufferedTokernizer however should only be used when there is no delimiter to work with. It uses a callback based system for application level tokenization without the heavy lifting.
Constant Summary collapse
- DEFAULT_ENCODING =
'ASCII-8BIT'
Instance Attribute Summary collapse
-
#callback ⇒ Object
Returns the value of attribute callback.
-
#indicator ⇒ Object
Returns the value of attribute indicator.
-
#size_limit ⇒ Object
Returns the value of attribute size_limit.
-
#verbose ⇒ Object
Returns the value of attribute verbose.
Instance Method Summary collapse
- #empty? ⇒ Boolean
-
#extract(data) ⇒ Object
Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator.
-
#flush ⇒ String
Flush the contents of the input buffer, i.e.
-
#initialize(options) ⇒ AbstractTokenizer
constructor
A new instance of AbstractTokenizer.
Constructor Details
#initialize(options) ⇒ AbstractTokenizer
Returns a new instance of AbstractTokenizer.
15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 15 def initialize() @callback = [:callback] @indicator = [:indicator] @size_limit = [:size_limit] @verbose = [:verbose] if @size_limit @encoding = [:encoding] || DEFAULT_ENCODING raise ArgumentError, 'no callback provided' unless @callback reset if @indicator.is_a?(String) @indicator = String.new(@indicator).force_encoding(@encoding).freeze end end |
Instance Attribute Details
#callback ⇒ Object
Returns the value of attribute callback.
12 13 14 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 12 def callback @callback end |
#indicator ⇒ Object
Returns the value of attribute indicator.
12 13 14 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 12 def indicator @indicator end |
#size_limit ⇒ Object
Returns the value of attribute size_limit.
12 13 14 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 12 def size_limit @size_limit end |
#verbose ⇒ Object
Returns the value of attribute verbose.
12 13 14 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 12 def verbose @verbose end |
Instance Method Details
#empty? ⇒ Boolean
103 104 105 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 103 def empty? @input.empty? end |
#extract(data) ⇒ Object
Extract takes an arbitrary string of input data and returns an array of tokenized entities using a message start indicator
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 39 def extract(data) data.force_encoding(@encoding) @input << data entities = [] loop do found = false last = if @indicator check = @input.partition(@indicator) break unless check[1].length > 0 check[2] else @input end result = @callback.call(last) if result found = true # Check for multi-byte indicator edge case case result when Integer entities << last[0...result] @input = last[result..-1] else entities << last reset end end break if not found end # Check to see if the buffer has exceeded capacity, if we're imposing a limit if @size_limit && @input.size > @size_limit if @indicator.respond_to?(:length) # check for regex # save enough of the buffer that if one character of the indicator were # missing we would match on next extract (very much an edge case) and # best we can do with a full buffer. @input = @input[-(@indicator.length - 1)..-1] else reset end raise 'input buffer exceeded limit' if @verbose end return entities end |
#flush ⇒ String
Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.
96 97 98 99 100 |
# File 'lib/uv-rays/abstract_tokenizer.rb', line 96 def flush buffer = @input reset buffer end |