Class: TfIdfSimilarity::Token

Inherits:
SimpleDelegator
  • Object
show all
Defined in:
lib/tf-idf-similarity/token.rb

Instance Method Summary collapse

Instance Method Details

#classic_filterToken

Returns a string with no English possessive or periods in acronyms.

Returns:

  • (Token)

    a string with no English possessive or periods in acronyms

See Also:



49
50
51
# File 'lib/tf-idf-similarity/token.rb', line 49

def classic_filter
  self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
end

#lowercase_filterToken

Returns a lowercase string.



40
41
42
# File 'lib/tf-idf-similarity/token.rb', line 40

def lowercase_filter
  self.class.new(UnicodeUtils.downcase(self))
end

#to_sObject



53
54
55
56
# File 'lib/tf-idf-similarity/token.rb', line 53

def to_s
  # Don't call #lowercase_filter and #classic_filter to avoid creating unnecessary objects.
  UnicodeUtils.downcase(self).gsub('.', '').sub(/['`’]s\z/, '')
end

#valid?Boolean

Note:

Some implementations ignore one and two-letter words.

Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.

Returns:

  • (Boolean)

    whether the string is a token



22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/tf-idf-similarity/token.rb', line 22

def valid?
  !self[%r{
    \A
      (
       \d           | # number
       [[:cntrl:]]  | # control character
       [[:punct:]]  | # punctuation
       [[:space:]]    # whitespace
      )+
    \z
  }x]
end