Class: ContentUrls::HtmlParser
- Inherits:
-
Object
- Object
- ContentUrls::HtmlParser
- Defined in:
- lib/content_urls/parsers/html_parser.rb
Overview
HtmlParser finds and rewrites URLs in HTML content.
Implementation note:
This methods in this class use Nokogiri to identify URLs. Nokogiri cleans HTML code when rewriting, so expect some changes to rewritten content.
Class Method Summary collapse
-
.base(content) ⇒ String
Returns the base URL/target for all relative URLs in the HTML content.
-
.rewrite_each_url(content, &block) ⇒ Object
Rewrites each URL in the HTML content by calling the supplied block with each URL.
-
.urls(content) ⇒ Array
Returns the URLs found in the HTML content.
Class Method Details
.base(content) ⇒ String
Returns the base URL/target for all relative URLs in the HTML content.
38 39 40 41 42 43 44 45 |
# File 'lib/content_urls/parsers/html_parser.rb', line 38 def self.base(content) doc = Nokogiri::HTML(content) if content rescue nil return nil if !doc base = doc.search('//head/base/@href').to_s.strip base = nil if base && base.empty? base end |
.rewrite_each_url(content, &block) ⇒ Object
Rewrites each URL in the HTML content by calling the supplied block with each URL.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/content_urls/parsers/html_parser.rb', line 57 def self.rewrite_each_url(content, &block) doc = Nokogiri::HTML(content) if content rescue nil return nil if !doc # TODO: handle href attribute of base tag # - should href URL be changed? # - should relative URLs be modified using base? # - how should rewritten relative URLs be handled? @@parser_definition.each do |type, definition| doc.search(definition[:xpath]).each do |obj| if definition.has_key?(:attribute) # use tag attribute if provided value = obj[definition[:attribute]] else # otherwise use tag's content value = obj.to_s end next if value.nil? or value.strip.empty? if definition.has_key?(:parser) # parse value using parser ContentUrls.rewrite_each_url(value, definition[:parser]) { |url| yield url } elsif definition.has_key?(:attribute) # rewrite the URL within the attribute if definition.has_key?(:url_regex) # use regex to obtain URL if (match = definition[:url_regex].match(value)) url = yield match[:url] next if url.nil? or url.to_s == match.to_s # don't change URL obj[definition[:attribute]] = match.pre_match + url.to_s + match.post_match end else # value is the URL next if value =~ /^#/ # do not capture anchors within the content being parsed url = yield value next if url.nil? or url.to_s == match.to_s # don't change URL #obj[definition[:attribute]] = url.to_s obj.set_attribute(definition[:attribute], url.to_s) end else $stderr.puts "WARNING: unable to rewrite URL for #{value.to_s}" end end end return doc.to_s end |
.urls(content) ⇒ Array
Returns the URLs found in the HTML content.
23 24 25 26 27 28 29 30 31 |
# File 'lib/content_urls/parsers/html_parser.rb', line 23 def self.urls(content) doc = Nokogiri::HTML(content) if content rescue nil urls = [] return urls if !doc rewrite_each_url(content) { |url| urls << url; url } urls.uniq! urls end |