Class: TwitterFriends::Scrape::ExpandedUrl

Inherits:
Struct
  • Object
show all
Defined in:
lib/wuclan/shorturl/shorturl_request.rb

Constant Summary collapse

RE_URL_SANE_CHARS =

These are all the characters that belong in a URL

Addressable::URI::CharacterClasses::UNRESERVED +
Addressable::URI::CharacterClasses::RESERVED   + '%'
RE_URL_ILLEGAL_BUT_WHATEVER_DOOD_CHARS =

These are illegal but are found in URLs. We’re going to let them through. Note that ‘ ’ space is one of the tolerated miscreants.

'\{\}\|\^\` '
TINY_URLISHES_RE =

The major shortening services

Do any of the mainstream shorteners use in-band characters besides w alphanum and - dash? (idek.net uses a ~ and pastoid.com a + but they are not popular enough to justify the annoyance of allowing extra chars).

%r{\Ahttp://(
| tinyurl.com                   # 4969626
| is.gd                         #  406718
| bit.ly                        #  298590
| twurl.nl                      #  169796
| snipurl.com                   #  107961
| tr.im                         #   38793
| snurl.com                     #   37576
| snipr.com                     #   26897
| jijr.com                      #   20965
| cli.gs                        #   19700
| budurl.com                    #   19402
| xrl.us                        #   11621
# | tiny.cc                     #    9140  # tiny.cc borks fetcher
| zi.ma                         #    8148
| s3nt.com                      #    6922
| ow.ly                         #    6848
| poprl.com                     #    6666
| piurl.com                     #    5262
| ur1.ca                        #    4435
| short.to                      #    4105
| urlenco.de                    #    4087
| zz.gd                         #    4045
| rubyurl.com                   #    3766
| uris.jp                       #    2749
| ub0.cc                        #    2607
| twurl.cc                      #    2545
| moourl.com                    #    2280
| rurl.org                      #    2271
| url.ie                        #    2156
)/([\w\-]+)}ix

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#dest_urlObject

Returns the value of attribute dest_url

Returns:

  • (Object)

    the current value of dest_url



15
16
17
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 15

def dest_url
  @dest_url
end

#scraped_atObject

Returns the value of attribute scraped_at

Returns:

  • (Object)

    the current value of scraped_at



15
16
17
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 15

def scraped_at
  @scraped_at
end

#src_urlObject

Returns the value of attribute src_url

Returns:

  • (Object)

    the current value of src_url



15
16
17
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 15

def src_url
  @src_url
end

Class Method Details

.match_tinyurlish(url) ⇒ Object



103
104
105
106
107
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 103

def self.match_tinyurlish url
  m = TINY_URLISHES_RE.match(url) or return
  host, path = m.captures
  "http://#{host.downcase}/#{path}"
end

.new_if_tinyurlish(url) ⇒ Object

If the base part looks like a tinyurlish, return an instantiated object Otherwise, return nil

This will happily turn

http://tinyurl.com/aaASDF/A-BUNCH_OF_BOGOSITY

into just the tinyurl.com/aaASDF



117
118
119
120
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 117

def self.new_if_tinyurlish url
  src_url = match_tinyurlish(url) or return
  new(src_url, nil, nil)
end

.scrub_url(url) ⇒ Object

Replace all url-insane characters by their %encoding. We don’t really care here whether the URLs do anything: we just want to remove stuff that absosmurfly don’t belong.

This code is stolen from Addressable::URI, which unfortunately has a bug in exactly this method (fixed here). (addressable.rubyforge.org) Note that we are /not/ re-encoding characters like ‘%’ – it’s assumed that the url is encoded, but perhaps poorly.

In practice the illegal characters most often seen are those in RE_URL_ILLEGAL_BUT_WHATEVER_DOOD_CHARS plus

<>"\t\\


44
45
46
47
48
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 44

def self.scrub_url url
  url.gsub(/[^#{RE_URL_SANE_CHARS+RE_URL_ILLEGAL_BUT_WHATEVER_DOOD_CHARS}]/) do |sequence|
    sequence.unpack('C*').map{ |c| ("%%%02x"%c).upcase }.join("")
  end
end

Instance Method Details

#fix_isgd_url!Object

is.gd urls use a terminal ‘-’ to indicate ‘preview’ – but we want the destination, so strip that.



60
61
62
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 60

def fix_isgd_url!
  self.src_url.gsub!(%r{(http://is.gd/\w+)[-/]}, '\1')
end

#fix_src_url!Object

Handle some known edge cases / simplifications with short urls



53
54
55
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 53

def fix_src_url!
  fix_isgd_url!
end

#num_key_fieldsObject

src_url uniquely identifies us



17
# File 'lib/wuclan/shorturl/shorturl_request.rb', line 17

def num_key_fields() 1  end