Module: Wuclan::Models::TweetRegexes

Included in:
TweetToken
Defined in:
lib/wuclan/twitter/model/tweet/tweet_regexes.rb

Constant Summary collapse

RE_DOMAIN_HEAD =

Twitter accepts URLs somewhat idiosyncratically, probably for good reason – we rarely see ()![] in urls; more likely in a status they are punctuation.

This is what I’ve reverse engineered.

Notes:

  • is.gd uses a trailing ‘-’ (to indicate ‘preview mode’): clever.

  • pastoid.com uses a trailing ‘+’, and idek.net a trailing ~ for no reason. annoying.

Counterexamples:

'(?:[a-zA-Z0-9\-]+\.)+'
RE_DOMAIN_TLD =
'(?:com|org|net|edu|gov|mil|biz|info|mobi|name|aero|jobs|museum|[a-zA-Z]{2})'
RE_URL_SCHEME_STRICT =

RE_URL_SCHEME = ‘[a-zA-Z]+’

'[a-zA-Z]{3,6}'
RE_URL_UNRESERVED =
'a-zA-Z0-9'   + '\-\._~'
RE_URL_OKCHARS =

not !$&()* [] |

RE_URL_UNRESERVED + '\'\+\,\;=' + '/%:@'
RE_URL_QUERYCHARS =
RE_URL_OKCHARS    + '&='
RE_URL_HOSTPART =
"#{RE_URL_SCHEME_STRICT}://#{RE_DOMAIN_HEAD}#{RE_DOMAIN_TLD}"
RE_URL =
%r{(
          #{RE_URL_HOSTPART}                   # Host
     (?:(?: \/ [#{RE_URL_OKCHARS}]+?          )*?    # path:  / delimited path segments
  (?: \/ [#{RE_URL_OKCHARS}]*[\w\-\+\~] )      #        where the last one ends in a non-punctuation.
 |                                             #        ... or no path segment
                                        )\/?   #        with an optional trailing slash
  (?: \? [#{RE_URL_QUERYCHARS}]+  )?           # query: introduced by a ?, with &foo= delimited segments
  (?: \# [#{RE_URL_OKCHARS}]+     )?           # frag:  introduced by a #
)}x
RE_HASHTAGS =

A hash following a non-alphanum_ (or at the start of the line followed by (any number of alpha, num, -_.+:=) and ending in an alphanum_

This is overly generous to those dorky triple tags (geo:lat=69.3), but we’ll soldier on somehow.

%r{(?:^|\W)\#([a-zA-Z0-9\-_\.+:=]+\w)(?:\W|$)}
RE_RETWEET_WORDS =

Retweets and Retweet Whores

See ARetweetsB for more info.

A retweet

RT @interesting_user Something so witty Dorothy Parker would just give up
Oh yeah and so's your mom (via @sixth_grader)
retweeting @ogre: KEGGER TONITE RT pls
  ^^^ this is not a rtwhore; it matches first as a retweet

and rtwhores

retweet please: Hey here's something I'm whoring xxx
KEGGER TONITE RT pls

or semantically-incorrect matches such as (actual example):

@somebody lol, love the 'please retweet' ending!

Things that don’t match:

retweet is silly, @i_think_youre_dumb
 misspell the name of my Sony Via
'rt|retweet|retweeting'
RE_RETWEET_ONLY =
%r{(?:#{RE_RETWEET_WORDS})}
RE_RETWEET_OR_VIA =
%r{(?:#{RE_RETWEET_WORDS}|via|from)}
RE_PLEASE =
%r{(?:please|plz|pls)}
RE_RETWEET =
%r{\b#{RE_RETWEET_OR_VIA}\W*@(\w+)\b}i
RE_RTWHORE =
%r{
  \b#{RE_RETWEET_ONLY}\W*#{RE_PLEASE}\b
| \b#{RE_PLEASE}\W*#{RE_RETWEET_ONLY}\b}ix
RE_ATSIGNS =

following either the start of the line, or a non-alphanum_ character the string of following [a-zA-Z0-9_]

Note carefully: we demand a preceding character (or start of line): b would match [email protected], which we don’t want.

Making an exception for RT@im_cramped_for_space.

All retweets

%r{(?:^|\W|#{RE_RETWEET_OR_VIA})@(\w+)\b}
RE_SMILIES_EYES =

Smilies !!! ^_^

"\\:8;"
RE_SMILIES_NOSE =
"\\-=\\*o"
RE_SMILIES_MOUTH =
"DP@Oo\\(\\)\\[\\]\\|\\{\\}\\/\\\\"
RE_SMILIES =
%r{
 (?:^|\W)                       # non-smilie character
 ( (?:
     >?
     [#{RE_SMILIES_EYES}]       # eyes
     [#{RE_SMILIES_NOSE}]?      # nose, maybe
     [#{RE_SMILIES_MOUTH}] )    # mouth
  |(?:
     [#{RE_SMILIES_MOUTH}]      # mouth
     [#{RE_SMILIES_NOSE}]?      # nose, maybe
     [#{RE_SMILIES_EYES}]       # eyes
     <? )
  |(?: =[#{RE_SMILIES_MOUTH}])  # =) (=
  |(?: [#{RE_SMILIES_MOUTH}]=)  # =) (=
  |(?: \^[_\-]\^ )              # kawaaaaiiii!
  |(?: :[,\']\( )               # snif
  |(?: <3 )                     # heart
  |(?: \\m/ )                   # rawk
  |(?: x-\( )                   # dead
 )
 (?:\W|$)
}x