Class: Tokkens::Tokens

Inherits:
Object
  • Object
show all
Defined in:
lib/tokkens/tokens.rb

Overview

Converts a string token to a uniquely identifying sequential number.

Useful for working with a vector space model for text.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(offset: 1) ⇒ Tokens

Returns a new instance of Tokens.



12
13
14
15
16
17
18
# File 'lib/tokkens/tokens.rb', line 12

def initialize(offset: 1)
  # liblinear can't use offset 0, libsvm doesn't mind to start at one
  @tokens = {}
  @offset = offset
  @next_number = offset
  @frozen = false
end

Instance Attribute Details

#offsetFixnum

Returns Number of first token.

Returns:

  • (Fixnum)

    Number of first token.



10
11
12
# File 'lib/tokkens/tokens.rb', line 10

def offset
  @offset
end

Instance Method Details

#find(i, prefix: nil) ⇒ String, NilClass

Return an token by number.

This class is optimized for retrieving by token, not by number.

Parameters:

  • i (String)

    number to return token for

  • prefix (String) (defaults to: nil)

    optional string to remove from beginning of token

Returns:

  • (String, NilClass)

    given token, or nil when not found



78
79
80
81
82
83
84
85
# File 'lib/tokkens/tokens.rb', line 78

def find(i, prefix: nil)
  @tokens.each do |s, data|
    if data[0] == i
      return (prefix && s.start_with?(prefix)) ? s[prefix.length..-1] : s
    end
  end
  nil
end

#freeze!Object

Stop assigning new numbers to token.

See Also:



23
24
25
# File 'lib/tokkens/tokens.rb', line 23

def freeze!
  @frozen = true
end

#frozen?Boolean

Returns Whether the tokens are frozen or not.

Returns:

  • (Boolean)

    Whether the tokens are frozen or not.

See Also:



37
38
39
# File 'lib/tokkens/tokens.rb', line 37

def frozen?
  @frozen
end

#get(s, **kwargs) ⇒ Fixnum, NilClass

Return a number for a new or existing token.

When the token was seen before, the same number is returned. If the token is first seen and this class isn’t #frozen?, a new number is returned; else nil is returned.

Parameters:

  • s (String)

    token to return number for

  • kwargs (Hash)

    a customizable set of options

Options Hash (**kwargs):

  • :prefix (String)

    optional string to prepend to the token

Returns:

  • (Fixnum, NilClass)

    number for given token



66
67
68
69
# File 'lib/tokkens/tokens.rb', line 66

def get(s, **kwargs)
  return if !s || s.strip == ''
  @frozen ? retrieve(s, **kwargs) : upsert(s, **kwargs)
end

#indexesArray<Fixnum>

Return indexes for all of the current tokens.

Returns:

  • (Array<Fixnum>)

    All current token numbers.

See Also:



91
92
93
# File 'lib/tokkens/tokens.rb', line 91

def indexes
  @tokens.values.map(&:first)
end

#limit!(max_size: nil, min_occurence: nil) ⇒ Fixnum

Limit the number of tokens.

Parameters:

  • max_size (Fixnum) (defaults to: nil)

    Maximum number of tokens to retain

  • min_occurence (Fixnum) (defaults to: nil)

    Keep only tokens seen at least this many times

Returns:

  • (Fixnum)

    Number of tokens left



46
47
48
49
50
51
52
53
54
55
# File 'lib/tokkens/tokens.rb', line 46

def limit!(max_size: nil, min_occurence: nil)
  # @todo raise if frozen
  if min_occurence
    @tokens.delete_if {|name, data| data[1] < min_occurence }
  end
  if max_size
    @tokens = Hash[@tokens.to_a.sort_by {|a| -a[1][1] }[0..(max_size-1)]]
  end
  @tokens.length
end

#load(filename) ⇒ Object

Load tokens from file.

The tokens are frozen by default. All previously existing tokens are removed.

Parameters:

  • filename (String)

    Filename



101
102
103
104
105
106
107
108
109
110
111
# File 'lib/tokkens/tokens.rb', line 101

def load(filename)
  File.open(filename) do |f|
    @tokens = {}
    f.each_line do |line|
      id, count, name = line.rstrip.split(/\s+/, 3)
      @tokens[name.strip] = [id.to_i, count]
    end
  end
  # safer
  freeze!
end

#save(filename) ⇒ Object

Save tokens to file.

Parameters:

  • filename (String)

    Filename



116
117
118
119
120
121
122
# File 'lib/tokkens/tokens.rb', line 116

def save(filename)
  File.open(filename, 'w') do |f|
    @tokens.each do |token, (index, count)|
      f.puts "#{index} #{count} #{token}"
    end
  end
end

#thaw!Object

Allow new tokens to be created.

See Also:



30
31
32
# File 'lib/tokkens/tokens.rb', line 30

def thaw!
  @frozen = false
end