Class: TfIdfSimilarity::Token
- Inherits:
-
SimpleDelegator
- Object
- SimpleDelegator
- TfIdfSimilarity::Token
- Defined in:
- lib/tf-idf-similarity/token.rb
Instance Method Summary collapse
-
#classic_filter ⇒ Token
Returns a string with no English possessive or periods in acronyms.
-
#lowercase_filter ⇒ Token
Returns a lowercase string.
- #to_s ⇒ Object
-
#valid? ⇒ Boolean
Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.
Instance Method Details
#classic_filter ⇒ Token
Returns a string with no English possessive or periods in acronyms.
49 50 51 |
# File 'lib/tf-idf-similarity/token.rb', line 49 def classic_filter self.class.new(self.gsub('.', '').sub(/['`’]s\z/, '')) end |
#lowercase_filter ⇒ Token
Returns a lowercase string.
40 41 42 |
# File 'lib/tf-idf-similarity/token.rb', line 40 def lowercase_filter self.class.new(UnicodeUtils.downcase(self)) end |
#to_s ⇒ Object
53 54 55 56 |
# File 'lib/tf-idf-similarity/token.rb', line 53 def to_s # Don't call #lowercase_filter and #classic_filter to avoid creating unnecessary objects. UnicodeUtils.downcase(self).gsub('.', '').sub(/['`’]s\z/, '') end |
#valid? ⇒ Boolean
Note:
Some implementations ignore one and two-letter words.
Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.
22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'lib/tf-idf-similarity/token.rb', line 22 def valid? !self[%r{ \A ( \d | # number [[:cntrl:]] | # control character [[:punct:]] | # punctuation [[:space:]] # whitespace )+ \z }x] end |