Class: TfIdfSimilarity::Token

Inherits:
SimpleDelegator
  • Object
show all
Defined in:
lib/tf-idf-similarity/token.rb

Instance Method Summary collapse

Instance Method Details

#classic_filterToken

Returns a string with no English possessive or periods in acronyms.

Returns:

  • (Token)

    a string with no English possessive or periods in acronyms

See Also:



46
47
48
# File 'lib/tf-idf-similarity/token.rb', line 46

def classic_filter
  self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
end

#lowercase_filterToken

Returns a lowercase string.



37
38
39
# File 'lib/tf-idf-similarity/token.rb', line 37

def lowercase_filter
  self.class.new(UnicodeUtils.downcase(self))
end

#valid?Boolean

Note:

Some implementations ignore one and two-letter words.

Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.

Returns:

  • (Boolean)

    whether the string is a token



19
20
21
22
23
24
25
26
27
28
29
30
# File 'lib/tf-idf-similarity/token.rb', line 19

def valid?
  !self[%r{
    \A
      (
       \d           | # number
       [[:cntrl:]]  | # control character
       [[:punct:]]  | # punctuation
       [[:space:]]    # whitespace
      )+
    \z
  }x]
end