Class: ContentUrls

Inherits:
Object
  • Object
show all
Defined in:
lib/content_urls.rb,
lib/content_urls/version.rb,
lib/content_urls/parsers/css_parser.rb,
lib/content_urls/parsers/html_parser.rb,
lib/content_urls/parsers/java_script_parser.rb

Overview

ContentUrls parses various file types (HTML, CSS, JavaScript, …) for URLs and provides methods for iterating through URLs and changing URLs.

Defined Under Namespace

Modules: Version Classes: CssParser, HtmlParser, JavaScriptParser, StyleParser

Class Method Summary collapse

Class Method Details

.base_url(content, type) ⇒ String

Returns base URL found in the content, if available.

Examples:

Parse HTML code for base URL

content = '<html><head><base href="/home/">'
puts "Found base URL: #{ContentUrls.base_url(content, 'text/html')}"
# => "Found base URL: /home/"

Parameters:

  • content (String)

    the content.

  • type (String)

    the media type of the content.

Returns:

  • (String)

    the base URL found in the content.



73
74
75
76
77
78
79
80
81
# File 'lib/content_urls.rb', line 73

def self.base_url(content, type)
  base = nil
  if (parser = get_parser(type))
    if (parser.respond_to?(:base))
      base = parser.base(content)
    end
  end
  base
end

.rewrite_each_url(content, type, &block) ⇒ Object

Rewrites each URL in the content by calling the supplied block with each URL.

Examples:

Rewrite URLs in HTML code

content = '<html><a href="index.htm">Home</a></html>'
content = ContentUrls.rewrite_each_url(content, 'text/html') {|url| 'gone.html'}
puts "Rewritten: #{content}"
# => "Rewritten: <html><a href="gone.html">Home</a></html>"

Parameters:

  • content (String)

    the HTML content.

  • type (String)

    the media type of the content.



95
96
97
98
99
100
101
102
103
# File 'lib/content_urls.rb', line 95

def self.rewrite_each_url(content, type, &block)
  if (parser = get_parser(type))
    parser.rewrite_each_url(content) do |url|
      replacement = yield url
      (replacement.nil? ? url : replacement)
    end
  end
  content
end

.to_absolute(url, base_url) ⇒ Object

Convert a relative URL to an absolute URL using base_url (for example, the content’s original location or an HTML document’s href attribute of the base tag).

Examples:

Obtain absolute URL of “../index.html” of page obtained from “example.com/one/two/sample.html

puts ContentUrls.to_absolute("../index.html", "http://example.com/folder/sample.html")
# => "http://example.com/index.html"


111
112
113
114
115
116
117
118
# File 'lib/content_urls.rb', line 111

def self.to_absolute(url, base_url)
  return nil if url.nil?

  url = URI.encode(URI.decode(url.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')))  # remove anchor
  absolute = URI(base_url).merge(url)
  absolute.path = '/' if absolute.path.empty?
  absolute.to_s
end

.urls(content, type, options = {}) ⇒ Array

Returns the URLs found in the content.

# @example Parse content obtained from a robot
 response = Net::HTTP.get_response(URI('http://example.com/sample-1'))
 puts "URLs found at http://example.com/sample-1:"
 ContentUrls.urls(response.body, response.content_type).each do |url|
   puts "  #{url}"
 end
 # => [a list of URLs found in the content located at http://example.com/sample-1]

Examples:

Parse HTML code for URLs

content = '<html><a href="index.html">Home</a></html>'
ContentUrls.urls(content, 'text/html').each do |url|
  puts "Found URL: #{url}"
end
# => "Found URL: index.html"

Parse HTML code for URLs, changing each to an absolute URL based on the address of the the original resource

content = '<html><a href="index.html">Home</a></html>'
ContentUrls.urls(content, 'text/html', content_url: 'http://www.example.com/sample.html').each do |url|
  puts "Found URL: #{url}"
end
# => "Found URL: http://www.example.com/index.html"

Parameters:

  • content (String)

    the content.

  • type (String)

    the media type of the content.

  • opts (Hash)

    the options for manipulating returned URLs

Returns:

  • (Array)

    the unique URLs found in the content.



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/content_urls.rb', line 39

def self.urls(content, type, options = {})
  options = {
      :use_base_url => false,
      :content_url => nil,
  }.merge(options)
  urls = []
  if (parser = get_parser(type))
    base = base_url(content, type) if options[:use_base_url]
    base = '' if URI(base || '').relative?
    if options[:content_url]
      content_url = URI(options[:content_url]) rescue ''
      content_url = '' if URI(content_url).relative?
      base = URI.join(content_url, base)
    end
    if URI(base).relative?
      parser.urls(content).each { |url| urls << url }
    else
      parser.urls(content).each { |url| urls << URI.join( base, url).to_s }
    end
  end
  urls
end