Class: TwitterFriends::Scrape::ExpandedUrl
- Inherits:
-
Struct
- Object
- Struct
- TwitterFriends::Scrape::ExpandedUrl
- Defined in:
- lib/wuclan/shorturl/shorturl_request.rb
Constant Summary collapse
- RE_URL_SANE_CHARS =
These are all the characters that belong in a URL
Addressable::URI::CharacterClasses::UNRESERVED + Addressable::URI::CharacterClasses::RESERVED + '%'
- RE_URL_ILLEGAL_BUT_WHATEVER_DOOD_CHARS =
These are illegal but are found in URLs. We’re going to let them through. Note that ‘ ’ space is one of the tolerated miscreants.
'\{\}\|\^\` '
- TINY_URLISHES_RE =
The major shortening services
Do any of the mainstream shorteners use in-band characters besides w alphanum and - dash? (idek.net uses a ~ and pastoid.com a + but they are not popular enough to justify the annoyance of allowing extra chars).
%r{\Ahttp://( | tinyurl.com # 4969626 | is.gd # 406718 | bit.ly # 298590 | twurl.nl # 169796 | snipurl.com # 107961 | tr.im # 38793 | snurl.com # 37576 | snipr.com # 26897 | jijr.com # 20965 | cli.gs # 19700 | budurl.com # 19402 | xrl.us # 11621 # | tiny.cc # 9140 # tiny.cc borks fetcher | zi.ma # 8148 | s3nt.com # 6922 | ow.ly # 6848 | poprl.com # 6666 | piurl.com # 5262 | ur1.ca # 4435 | short.to # 4105 | urlenco.de # 4087 | zz.gd # 4045 | rubyurl.com # 3766 | uris.jp # 2749 | ub0.cc # 2607 | twurl.cc # 2545 | moourl.com # 2280 | rurl.org # 2271 | url.ie # 2156 )/([\w\-]+)}ix
Instance Attribute Summary collapse
-
#dest_url ⇒ Object
Returns the value of attribute dest_url.
-
#scraped_at ⇒ Object
Returns the value of attribute scraped_at.
-
#src_url ⇒ Object
Returns the value of attribute src_url.
Class Method Summary collapse
- .match_tinyurlish(url) ⇒ Object
-
.new_if_tinyurlish(url) ⇒ Object
If the base part looks like a tinyurlish, return an instantiated object Otherwise, return nil.
-
.scrub_url(url) ⇒ Object
Replace all url-insane characters by their %encoding.
Instance Method Summary collapse
-
#fix_isgd_url! ⇒ Object
is.gd urls use a terminal ‘-’ to indicate ‘preview’ – but we want the destination, so strip that.
-
#fix_src_url! ⇒ Object
Handle some known edge cases / simplifications with short urls.
-
#num_key_fields ⇒ Object
src_url uniquely identifies us.
Instance Attribute Details
#dest_url ⇒ Object
Returns the value of attribute dest_url
15 16 17 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 15 def dest_url @dest_url end |
#scraped_at ⇒ Object
Returns the value of attribute scraped_at
15 16 17 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 15 def scraped_at @scraped_at end |
#src_url ⇒ Object
Returns the value of attribute src_url
15 16 17 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 15 def src_url @src_url end |
Class Method Details
.match_tinyurlish(url) ⇒ Object
103 104 105 106 107 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 103 def self.match_tinyurlish url m = TINY_URLISHES_RE.match(url) or return host, path = m.captures "http://#{host.downcase}/#{path}" end |
.new_if_tinyurlish(url) ⇒ Object
If the base part looks like a tinyurlish, return an instantiated object Otherwise, return nil
This will happily turn
http://tinyurl.com/aaASDF/A-BUNCH_OF_BOGOSITY
into just the tinyurl.com/aaASDF
117 118 119 120 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 117 def self.new_if_tinyurlish url src_url = match_tinyurlish(url) or return new(src_url, nil, nil) end |
.scrub_url(url) ⇒ Object
Replace all url-insane characters by their %encoding. We don’t really care here whether the URLs do anything: we just want to remove stuff that absosmurfly don’t belong.
This code is stolen from Addressable::URI, which unfortunately has a bug in exactly this method (fixed here). (addressable.rubyforge.org) Note that we are /not/ re-encoding characters like ‘%’ – it’s assumed that the url is encoded, but perhaps poorly.
In practice the illegal characters most often seen are those in RE_URL_ILLEGAL_BUT_WHATEVER_DOOD_CHARS plus
<>"\t\\
44 45 46 47 48 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 44 def self.scrub_url url url.gsub(/[^#{RE_URL_SANE_CHARS+RE_URL_ILLEGAL_BUT_WHATEVER_DOOD_CHARS}]/) do |sequence| sequence.unpack('C*').map{ |c| ("%%%02x"%c).upcase }.join("") end end |
Instance Method Details
#fix_isgd_url! ⇒ Object
is.gd urls use a terminal ‘-’ to indicate ‘preview’ – but we want the destination, so strip that.
60 61 62 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 60 def fix_isgd_url! self.src_url.gsub!(%r{(http://is.gd/\w+)[-/]}, '\1') end |
#fix_src_url! ⇒ Object
Handle some known edge cases / simplifications with short urls
53 54 55 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 53 def fix_src_url! fix_isgd_url! end |
#num_key_fields ⇒ Object
src_url uniquely identifies us
17 |
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 17 def num_key_fields() 1 end |