Class: TfIdfSimilarity::Token
- Inherits:
-
String
- Object
- String
- TfIdfSimilarity::Token
- Defined in:
- lib/tf-idf-similarity/token.rb
Overview
Note:
We can add more filters from Solr and stem using Porter's Snowball.
A token.
Instance Method Summary collapse
-
#classic_filter ⇒ Token
Returns a string with no English possessive or periods in acronyms.
-
#lowercase_filter ⇒ Token
Returns a lowercase string.
-
#valid? ⇒ Boolean
Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.
Instance Method Details
#classic_filter ⇒ Token
Returns a string with no English possessive or periods in acronyms.
48 49 50 |
# File 'lib/tf-idf-similarity/token.rb', line 48 def classic_filter self.class.new(self.gsub('.', '').chomp("'s")) end |
#lowercase_filter ⇒ Token
Returns a lowercase string.
36 37 38 39 40 41 |
# File 'lib/tf-idf-similarity/token.rb', line 36 def lowercase_filter self.class.new(defined?(UnicodeUtils) ? UnicodeUtils.downcase(self) : tr( "ÀÁÂÃÄÅĀĂĄÇĆĈĊČÐĎĐÈÉÊËĒĔĖĘĚĜĞĠĢĤĦÌÍÎÏĨĪĬĮĴĶĹĻĽĿŁÑŃŅŇŊÒÓÔÕÖØŌŎŐŔŖŘŚŜŞŠŢŤŦÙÚÛÜŨŪŬŮŰŲŴÝŶŸŹŻŽ", "àáâãäåāăąçćĉċčðďđèéêëēĕėęěĝğġģĥħìíîïĩīĭįĵķĺļľŀłñńņňŋòóôõöøōŏőŕŗřśŝşšţťŧùúûüũūŭůűųŵýŷÿźżž" ).downcase) end |
#valid? ⇒ Boolean
Note:
Some implementations ignore one and two-letter words.
Returns a falsy value if all its characters are numbers, punctuation, whitespace or control characters.
18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/tf-idf-similarity/token.rb', line 18 def valid? !self[%r{ \A ( \d | # number [[:cntrl:]] | # control character [[:punct:]] | # punctuation [[:space:]] # whitespace )+ \z }x] end |