Class: Wuclan::Models::WordToken
- Inherits:
-
TweetToken
- Object
- TweetToken
- Wuclan::Models::WordToken
- Defined in:
- lib/wuclan/twitter/model/tweet/tweet_token.rb
Constant Summary
Constants included from TweetRegexes
TweetRegexes::RE_ATSIGNS, TweetRegexes::RE_DOMAIN_HEAD, TweetRegexes::RE_DOMAIN_TLD, TweetRegexes::RE_HASHTAGS, TweetRegexes::RE_PLEASE, TweetRegexes::RE_RETWEET, TweetRegexes::RE_RETWEET_ONLY, TweetRegexes::RE_RETWEET_OR_VIA, TweetRegexes::RE_RETWEET_WORDS, TweetRegexes::RE_RTWHORE, TweetRegexes::RE_SMILIES, TweetRegexes::RE_SMILIES_EYES, TweetRegexes::RE_SMILIES_MOUTH, TweetRegexes::RE_SMILIES_NOSE, TweetRegexes::RE_URL, TweetRegexes::RE_URL_HOSTPART, TweetRegexes::RE_URL_OKCHARS, TweetRegexes::RE_URL_QUERYCHARS, TweetRegexes::RE_URL_SCHEME_STRICT, TweetRegexes::RE_URL_UNRESERVED
Class Method Summary collapse
-
.extract_tokens!(str) ⇒ Object
This is pretty simpleminded.
Methods inherited from TweetToken
#initialize, #num_key_fields, #numeric_id_fields
Constructor Details
This class inherits a constructor from Wuclan::Models::TweetToken
Class Method Details
.extract_tokens!(str) ⇒ Object
This is pretty simpleminded.
returns all words of three or more letters.
-
terminal ‘t and ’s (as in “don’t” and “it’s”) are tokenised together
*
-
FIXME – this doesn’t leave str as blank, as it should to behave like the other ! methods
66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/wuclan/twitter/model/tweet/tweet_token.rb', line 66 def self.extract_tokens! str return [] unless str str = str.downcase; # kill off all punctuation except 's # this includes hyphens (words are split) str = str.gsub(/[^\w\'@]+/, ' ').gsub(/\'([st])\b/, '!\1').gsub(/\'/, ' ').gsub(/!/, "'") # Busticate at whitespace words = str.strip.split(/\s+/) # words.reject{|w| w.blank? || (w.length < 3) } end |