Class: Wuclan::Models::WordToken

Inherits:
TweetToken show all
Defined in:
lib/wuclan/twitter/model/tweet/tweet_token.rb

Constant Summary

Constants included from TweetRegexes

TweetRegexes::RE_ATSIGNS, TweetRegexes::RE_DOMAIN_HEAD, TweetRegexes::RE_DOMAIN_TLD, TweetRegexes::RE_HASHTAGS, TweetRegexes::RE_PLEASE, TweetRegexes::RE_RETWEET, TweetRegexes::RE_RETWEET_ONLY, TweetRegexes::RE_RETWEET_OR_VIA, TweetRegexes::RE_RETWEET_WORDS, TweetRegexes::RE_RTWHORE, TweetRegexes::RE_SMILIES, TweetRegexes::RE_SMILIES_EYES, TweetRegexes::RE_SMILIES_MOUTH, TweetRegexes::RE_SMILIES_NOSE, TweetRegexes::RE_URL, TweetRegexes::RE_URL_HOSTPART, TweetRegexes::RE_URL_OKCHARS, TweetRegexes::RE_URL_QUERYCHARS, TweetRegexes::RE_URL_SCHEME_STRICT, TweetRegexes::RE_URL_UNRESERVED

Class Method Summary collapse

Methods inherited from TweetToken

#initialize, #num_key_fields, #numeric_id_fields

Constructor Details

This class inherits a constructor from Wuclan::Models::TweetToken

Class Method Details

.extract_tokens!(str) ⇒ Object

This is pretty simpleminded.

returns all words of three or more letters.

  • terminal ‘t and ’s (as in “don’t” and “it’s”) are tokenised together

*

  • FIXME – this doesn’t leave str as blank, as it should to behave like the other ! methods



66
67
68
69
70
71
72
73
74
75
76
# File 'lib/wuclan/twitter/model/tweet/tweet_token.rb', line 66

def self.extract_tokens! str
  return [] unless str
  str = str.downcase;
  # kill off all punctuation except 's
  # this includes hyphens (words are split)
  str = str.gsub(/[^\w\'@]+/, ' ').gsub(/\'([st])\b/, '!\1').gsub(/\'/, ' ').gsub(/!/, "'")
  # Busticate at whitespace
  words = str.strip.split(/\s+/)
  #
  words.reject{|w| w.blank? || (w.length < 3) }
end