Class: TfIdfSimilarity::Token

Inherits:
String
  • Object
show all
Defined in:
lib/tf-idf-similarity/token.rb

Overview

Note:

We can add more filters from Solr and stem using Porter's Snowball.

A token.

Instance Method Summary collapse

Instance Method Details

#classic_filterToken

Returns a string with no English possessive or periods in acronyms.

Returns:

  • (Token)

    a string with no English possessive or periods in acronyms

See Also:



48
49
50
# File 'lib/tf-idf-similarity/token.rb', line 48

def classic_filter
  self.class.new(self.gsub('.', '').chomp("'s"))
end

#lowercase_filterToken

Returns a lowercase string.



36
37
38
39
40
41
# File 'lib/tf-idf-similarity/token.rb', line 36

def lowercase_filter
  self.class.new(defined?(UnicodeUtils) ? UnicodeUtils.downcase(self) : tr(
    "ÀÁÂÃÄÅĀĂĄÇĆĈĊČÐĎĐÈÉÊËĒĔĖĘĚĜĞĠĢĤĦÌÍÎÏĨĪĬĮĴĶĹĻĽĿŁÑŃŅŇŊÒÓÔÕÖØŌŎŐŔŖŘŚŜŞŠŢŤŦÙÚÛÜŨŪŬŮŰŲŴÝŶŸŹŻŽ",
    "àáâãäåāăąçćĉċčðďđèéêëēĕėęěĝğġģĥħìíîïĩīĭįĵķĺļľŀłñńņňŋòóôõöøōŏőŕŗřśŝşšţťŧùúûüũūŭůűųŵýŷÿźżž"
  ).downcase)
end

#valid?Boolean

Note:

Some implementations ignore one and two-letter words.

Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.

Returns:

  • (Boolean)

    whether the string is a token



18
19
20
21
22
23
24
25
26
27
28
29
# File 'lib/tf-idf-similarity/token.rb', line 18

def valid?
  !self[%r{
    \A
      (
       \d           | # number
       [[:cntrl:]]  | # control character
       [[:punct:]]  | # punctuation
       [[:space:]]    # whitespace
      )+
    \z
  }x]
end