ContentUrls

Find and rewrite URLs in different types of content.

ContentUrls was developed to address two use cases:

  • Find each URL in content retrieved from a website in order to spider and find all content on the website.

  • Rewrite each URL in content retrieved from a website in order to make a working local copy of the website.

Features

  • Three types of content: HTML, CSS and JavaScript

    • HTML content

      • <a> tag href attribute

      • <area> tag href attribute

      • <body> tag background attribute

      • <embed> tag src attribute

      • <frame> tag src attribute

      • <iframe> tag src attribute

      • <img> tag src attribute

      • <link> tag href attribute

      • <meta> tag content attribute containing URL

      • <object> tag data attribute

      • <script> tag src attribute

      • style attribute of any tag (parsed as CSS content)

      • body of <style> tag (parsed as CSS content)

      • body of <script> tag when type or language attribute identifies JavaScript (parsed as JavaScript content)

  • CSS content

    • url() notation

  • JavaScript content

    • URI module’s REGEXP

  • Can convert relative URLs to absolute URLs by providing resource URL

  • Can convert relative URLs to absolute URLs when base URL found in HTML content

Examples

Find URLs in an HTML document

Provide the HTML content and the content type and obtain an array of unique URLs.

ContentUrls.urls(html, 'text/html').each do |url|
  puts "Found URL: #{url}"
end

Rewrite URLs in an HTML document

Provide the HTML content, the content type, and a block to rewrite each URL’s extension.

rewritten_html = ContentUrls.rewrite_each_url(html, 'text/html') {|url| url.sub(/.htm/, '.html'}

Requirements

  • nokogiri

  • css_parser

  • rkelly

Development

To test and develop this gem, additional requirements are:

  • bundler

  • jeweler

  • rake

  • rdoc

  • rspec

  • yard

Goals for ContentUrls

  • Include support for:

    • Acrobat (.pdf)

    • Flash (.swf)

    • Microsoft Office (.doc, .xls, .ppt)

    • text (regular expression for URLs)

  • Capture links retrieved from a headless web browser which executes the code (JavaScript, etc.)

Contributing to content_urls

  • Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet.

  • Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it.

  • Fork the project.

  • Start a feature/bugfix branch.

  • Commit and push until you are happy with your contribution.

  • Make sure to add tests for it. This is important so I don’t unintentionally break it in a future version.

  • Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Copyright © 2013 Dennis Sutch. See LICENSE.txt for further details.