Module: Wuclan::Models::TweetRegexes
- Included in:
- TweetToken
- Defined in:
- lib/wuclan/twitter/model/tweet/tweet_regexes.rb
Constant Summary collapse
- RE_DOMAIN_HEAD =
Twitter accepts URLs somewhat idiosyncratically, probably for good reason – we rarely see ()![] in urls; more likely in a status they are punctuation.
This is what I’ve reverse engineered.
Notes:
-
is.gd uses a trailing ‘-’ (to indicate ‘preview mode’): clever.
-
pastoid.com uses a trailing ‘+’, and idek.net a trailing ~ for no reason. annoying.
Counterexamples:
-
'(?:[a-zA-Z0-9\-]+\.)+'
- RE_DOMAIN_TLD =
'(?:com|org|net|edu|gov|mil|biz|info|mobi|name|aero|jobs|museum|[a-zA-Z]{2})'
- RE_URL_SCHEME_STRICT =
RE_URL_SCHEME = ‘[a-zA-Z]+’
'[a-zA-Z]{3,6}'
- RE_URL_UNRESERVED =
'a-zA-Z0-9' + '\-\._~'
- RE_URL_OKCHARS =
not !$&()* [] |
RE_URL_UNRESERVED + '\'\+\,\;=' + '/%:@'
- RE_URL_QUERYCHARS =
RE_URL_OKCHARS + '&='
- RE_URL_HOSTPART =
"#{RE_URL_SCHEME_STRICT}://#{RE_DOMAIN_HEAD}#{RE_DOMAIN_TLD}"
- RE_URL =
%r{( #{RE_URL_HOSTPART} # Host (?:(?: \/ [#{RE_URL_OKCHARS}]+? )*? # path: / delimited path segments (?: \/ [#{RE_URL_OKCHARS}]*[\w\-\+\~] ) # where the last one ends in a non-punctuation. | # ... or no path segment )\/? # with an optional trailing slash (?: \? [#{RE_URL_QUERYCHARS}]+ )? # query: introduced by a ?, with &foo= delimited segments (?: \# [#{RE_URL_OKCHARS}]+ )? # frag: introduced by a # )}x
- RE_HASHTAGS =
A hash following a non-alphanum_ (or at the start of the line followed by (any number of alpha, num, -_.+:=) and ending in an alphanum_
This is overly generous to those dorky triple tags (geo:lat=69.3), but we’ll soldier on somehow.
%r{(?:^|\W)\#([a-zA-Z0-9\-_\.+:=]+\w)(?:\W|$)}
- RE_RETWEET_WORDS =
Retweets and Retweet Whores
See ARetweetsB for more info.
A retweet
RT @interesting_user Something so witty Dorothy Parker would just give up Oh yeah and so's your mom (via @sixth_grader) retweeting @ogre: KEGGER TONITE RT pls ^^^ this is not a rtwhore; it matches first as a retweet
and rtwhores
retweet please: Hey here's something I'm whoring xxx KEGGER TONITE RT pls
or semantically-incorrect matches such as (actual example):
@somebody lol, love the 'please retweet' ending!
Things that don’t match:
retweet is silly, @i_think_youre_dumb misspell the name of my Sony Via
'rt|retweet|retweeting'
- RE_RETWEET_ONLY =
%r{(?:#{RE_RETWEET_WORDS})}
- RE_RETWEET_OR_VIA =
%r{(?:#{RE_RETWEET_WORDS}|via|from)}
- RE_PLEASE =
%r{(?:please|plz|pls)}
- RE_RETWEET =
%r{\b#{RE_RETWEET_OR_VIA}\W*@(\w+)\b}i
- RE_RTWHORE =
%r{ \b#{RE_RETWEET_ONLY}\W*#{RE_PLEASE}\b | \b#{RE_PLEASE}\W*#{RE_RETWEET_ONLY}\b}ix
- RE_ATSIGNS =
following either the start of the line, or a non-alphanum_ character the string of following [a-zA-Z0-9_]
Note carefully: we demand a preceding character (or start of line): b would match [email protected], which we don’t want.
Making an exception for RT@im_cramped_for_space.
All retweets
%r{(?:^|\W|#{RE_RETWEET_OR_VIA})@(\w+)\b}
- RE_SMILIES_EYES =
Smilies !!! ^_^
"\\:8;"
- RE_SMILIES_NOSE =
"\\-=\\*o"
- RE_SMILIES_MOUTH =
"DP@Oo\\(\\)\\[\\]\\|\\{\\}\\/\\\\"
- RE_SMILIES =
%r{ (?:^|\W) # non-smilie character ( (?: >? [#{RE_SMILIES_EYES}] # eyes [#{RE_SMILIES_NOSE}]? # nose, maybe [#{RE_SMILIES_MOUTH}] ) # mouth |(?: [#{RE_SMILIES_MOUTH}] # mouth [#{RE_SMILIES_NOSE}]? # nose, maybe [#{RE_SMILIES_EYES}] # eyes <? ) |(?: =[#{RE_SMILIES_MOUTH}]) # =) (= |(?: [#{RE_SMILIES_MOUTH}]=) # =) (= |(?: \^[_\-]\^ ) # kawaaaaiiii! |(?: :[,\']\( ) # snif |(?: <3 ) # heart |(?: \\m/ ) # rawk |(?: x-\( ) # dead ) (?:\W|$) }x