Class: ContentUrls::HtmlParser

Inherits:
Object
  • Object
show all
Defined in:
lib/content_urls/parsers/html_parser.rb

Overview

HtmlParser finds and rewrites URLs in HTML content.

Implementation note:

This methods in this class use Nokogiri to identify URLs. Nokogiri cleans HTML code when rewriting, so expect some changes to rewritten content.

Class Method Summary collapse

Class Method Details

.base(content) ⇒ String

Returns the base URL/target for all relative URLs in the HTML content.

Parameters:

  • content (String)

    the HTML content.

Returns:

  • (String)

    the URL/target found in the content.



38
39
40
41
42
43
44
45
# File 'lib/content_urls/parsers/html_parser.rb', line 38

def self.base(content)
  doc = Nokogiri::HTML(content) if content rescue nil
  return nil if !doc

  base = doc.search('//head/base/@href').to_s.strip
  base = nil if base && base.empty?
  base
end

.rewrite_each_url(content, &block) ⇒ Object

Rewrites each URL in the HTML content by calling the supplied block with each URL.

Examples:

Rewrite URLs in HTML code

html = '<html><a href="index.htm">Click me</a></html>'
html = ContentUrls::HtmlParser.rewrite_each_url(html) {|url| 'index.php'}
puts "Rewritten: #{html}"
# => "Rewritten: <html><a href="index.php">Click me</a></html>"

Parameters:

  • content (String)

    the HTML content.



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/content_urls/parsers/html_parser.rb', line 57

def self.rewrite_each_url(content, &block)
  doc = Nokogiri::HTML(content) if content rescue nil
  return nil if !doc

  # TODO: handle href attribute of base tag
  #  - should href URL be changed?
  #  - should relative URLs be modified using base?
  #  - how should rewritten relative URLs be handled?

  @@parser_definition.each do |type, definition|
    doc.search(definition[:xpath]).each do |obj|
      if definition.has_key?(:attribute)  # use tag attribute if provided
        value = obj[definition[:attribute]]
      else  # otherwise use tag's content
        value = obj.to_s
      end
      next if value.nil? or value.strip.empty?

      if definition.has_key?(:parser)  # parse value using parser
        ContentUrls.rewrite_each_url(value, definition[:parser]) { |url| yield url }

      elsif definition.has_key?(:attribute)  # rewrite the URL within the attribute

        if definition.has_key?(:url_regex)  # use regex to obtain URL
          if (match = definition[:url_regex].match(value))
            url = yield match[:url]
            next if url.nil? or url.to_s == match.to_s  # don't change URL
            obj[definition[:attribute]] = match.pre_match + url.to_s + match.post_match
          end

        else  # value is the URL
          next if value =~ /^#/  # do not capture anchors within the content being parsed
          url = yield value
          next if url.nil? or url.to_s == match.to_s  # don't change URL
          #obj[definition[:attribute]] = url.to_s
          obj.set_attribute(definition[:attribute], url.to_s)
        end
      else
        $stderr.puts "WARNING: unable to rewrite URL for #{value.to_s}"
      end
    end
  end
  return doc.to_s
end

.urls(content) ⇒ Array

Returns the URLs found in the HTML content.

Examples:

Parse HTML code for URLs

html = '<html><a href="index.htm">Click me</a></html>'
ContentUrls::HtmlParser.urls(html).each do |url|
  puts "Found URL: #{url}"
end
# => "Found URL: index.htm"

Parameters:

  • content (String)

    the HTML content.

Returns:

  • (Array)

    the unique URLs found in the content.



23
24
25
26
27
28
29
30
31
# File 'lib/content_urls/parsers/html_parser.rb', line 23

def self.urls(content)
  doc = Nokogiri::HTML(content) if content rescue nil
  urls = []
  return urls if !doc

  rewrite_each_url(content) { |url| urls << url; url }
  urls.uniq!
  urls
end