Class: TfIdfSimilarity::Token
- Inherits:
-
SimpleDelegator
- Object
- SimpleDelegator
- TfIdfSimilarity::Token
- Defined in:
- lib/tf-idf-similarity/token.rb
Instance Method Summary collapse
-
#classic_filter ⇒ Token
Returns a string with no English possessive or periods in acronyms.
-
#lowercase_filter ⇒ Token
Returns a lowercase string.
-
#valid? ⇒ Boolean
Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.
Instance Method Details
#classic_filter ⇒ Token
Returns a string with no English possessive or periods in acronyms.
46 47 48 |
# File 'lib/tf-idf-similarity/token.rb', line 46 def classic_filter self.class.new(self.gsub('.', '').sub(/['`’]s\z/, '')) end |
#lowercase_filter ⇒ Token
Returns a lowercase string.
37 38 39 |
# File 'lib/tf-idf-similarity/token.rb', line 37 def lowercase_filter self.class.new(UnicodeUtils.downcase(self)) end |
#valid? ⇒ Boolean
Note:
Some implementations ignore one and two-letter words.
Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.
19 20 21 22 23 24 25 26 27 28 29 30 |
# File 'lib/tf-idf-similarity/token.rb', line 19 def valid? !self[%r{ \A ( \d | # number [[:cntrl:]] | # control character [[:punct:]] | # punctuation [[:space:]] # whitespace )+ \z }x] end |