Class: Tokkens::Tokens
- Inherits:
-
Object
- Object
- Tokkens::Tokens
- Defined in:
- lib/tokkens/tokens.rb
Overview
Converts a string token to a uniquely identifying sequential number.
Useful for working with a vector space model for text.
Instance Attribute Summary collapse
-
#offset ⇒ Fixnum
Number of first token.
Instance Method Summary collapse
-
#find(i, prefix: nil) ⇒ String, NilClass
Return an token by number.
-
#freeze! ⇒ Object
Stop assigning new numbers to token.
-
#frozen? ⇒ Boolean
Whether the tokens are frozen or not.
-
#get(s, **kwargs) ⇒ Fixnum, NilClass
Return a number for a new or existing token.
-
#indexes ⇒ Array<Fixnum>
Return indexes for all of the current tokens.
-
#initialize(offset: 1) ⇒ Tokens
constructor
A new instance of Tokens.
-
#limit!(max_size: nil, min_occurence: nil) ⇒ Fixnum
Limit the number of tokens.
-
#load(filename) ⇒ Object
Load tokens from file.
-
#save(filename) ⇒ Object
Save tokens to file.
-
#thaw! ⇒ Object
Allow new tokens to be created.
Constructor Details
#initialize(offset: 1) ⇒ Tokens
Returns a new instance of Tokens.
12 13 14 15 16 17 18 |
# File 'lib/tokkens/tokens.rb', line 12 def initialize(offset: 1) # liblinear can't use offset 0, libsvm doesn't mind to start at one @tokens = {} @offset = offset @next_number = offset @frozen = false end |
Instance Attribute Details
#offset ⇒ Fixnum
Returns Number of first token.
10 11 12 |
# File 'lib/tokkens/tokens.rb', line 10 def offset @offset end |
Instance Method Details
#find(i, prefix: nil) ⇒ String, NilClass
Return an token by number.
This class is optimized for retrieving by token, not by number.
78 79 80 81 82 83 84 85 |
# File 'lib/tokkens/tokens.rb', line 78 def find(i, prefix: nil) @tokens.each do |s, data| if data[0] == i return (prefix && s.start_with?(prefix)) ? s[prefix.length..-1] : s end end nil end |
#freeze! ⇒ Object
Stop assigning new numbers to token.
23 24 25 |
# File 'lib/tokkens/tokens.rb', line 23 def freeze! @frozen = true end |
#frozen? ⇒ Boolean
Returns Whether the tokens are frozen or not.
37 38 39 |
# File 'lib/tokkens/tokens.rb', line 37 def frozen? @frozen end |
#get(s, **kwargs) ⇒ Fixnum, NilClass
Return a number for a new or existing token.
When the token was seen before, the same number is returned. If the token is first seen and this class isn’t #frozen?, a new number is returned; else nil is returned.
66 67 68 69 |
# File 'lib/tokkens/tokens.rb', line 66 def get(s, **kwargs) return if !s || s.strip == '' @frozen ? retrieve(s, **kwargs) : upsert(s, **kwargs) end |
#indexes ⇒ Array<Fixnum>
Return indexes for all of the current tokens.
91 92 93 |
# File 'lib/tokkens/tokens.rb', line 91 def indexes @tokens.values.map(&:first) end |
#limit!(max_size: nil, min_occurence: nil) ⇒ Fixnum
Limit the number of tokens.
46 47 48 49 50 51 52 53 54 55 |
# File 'lib/tokkens/tokens.rb', line 46 def limit!(max_size: nil, min_occurence: nil) # @todo raise if frozen if min_occurence @tokens.delete_if {|name, data| data[1] < min_occurence } end if max_size @tokens = Hash[@tokens.to_a.sort_by {|a| -a[1][1] }[0..(max_size-1)]] end @tokens.length end |
#load(filename) ⇒ Object
Load tokens from file.
The tokens are frozen by default. All previously existing tokens are removed.
101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/tokkens/tokens.rb', line 101 def load(filename) File.open(filename) do |f| @tokens = {} f.each_line do |line| id, count, name = line.rstrip.split(/\s+/, 3) @tokens[name.strip] = [id.to_i, count] end end # safer freeze! end |
#save(filename) ⇒ Object
Save tokens to file.
116 117 118 119 120 121 122 |
# File 'lib/tokkens/tokens.rb', line 116 def save(filename) File.open(filename, 'w') do |f| @tokens.each do |token, (index, count)| f.puts "#{index} #{count} #{token}" end end end |
#thaw! ⇒ Object
Allow new tokens to be created.
30 31 32 |
# File 'lib/tokkens/tokens.rb', line 30 def thaw! @frozen = false end |