Class: MetaInspector::Scraper
- Inherits:
-
Object
- Object
- MetaInspector::Scraper
- Defined in:
- lib/meta_inspector/scraper.rb
Instance Attribute Summary collapse
-
#content_type ⇒ Object
readonly
Returns the value of attribute content_type.
-
#errors ⇒ Object
readonly
Returns the value of attribute errors.
-
#host ⇒ Object
readonly
Returns the value of attribute host.
-
#root_url ⇒ Object
readonly
Returns the value of attribute root_url.
-
#scheme ⇒ Object
readonly
Returns the value of attribute scheme.
-
#url ⇒ Object
readonly
Returns the value of attribute url.
Instance Method Summary collapse
-
#charset ⇒ Object
Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />.
-
#description ⇒ Object
A description getter that first checks for a meta description and if not present will guess by looking grabbing the first paragraph > 120 characters.
-
#document ⇒ Object
Returns the original, unparsed document.
-
#external_links ⇒ Object
External links found on the page, as absolute URLs.
-
#feed ⇒ Object
Returns the parsed document meta rss links.
-
#image ⇒ Object
Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/.
-
#images ⇒ Object
Images found on the page, as absolute URLs.
-
#initialize(url, options = {}) ⇒ Scraper
constructor
Initializes a new instance of MetaInspector, setting the URL to the one given If no scheme given, set it to http:// by default Options: => timeout: defaults to 20 seconds => html_content_type_only: if an exception should be raised if request content-type is not text/html.
-
#internal_links ⇒ Object
Internal links found on the page, as absolute URLs.
-
#links ⇒ Object
Links found on the page, as absolute URLs.
-
#method_missing(method_name) ⇒ Object
Scrapers for all meta_tags in the form of “meta_name” are automatically defined.
-
#parsed? ⇒ Boolean
Returns true if parsing has been successful.
-
#parsed_document ⇒ Object
Returns the whole parsed document.
-
#title ⇒ Object
Returns the parsed document title, from the content of the <title> tag.
-
#to_hash ⇒ Object
Returns all parsed data as a nested Hash.
Constructor Details
#initialize(url, options = {}) ⇒ Scraper
Initializes a new instance of MetaInspector, setting the URL to the one given If no scheme given, set it to http:// by default Options:
> timeout: defaults to 20 seconds
> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false
18 19 20 21 22 23 24 25 26 27 28 |
# File 'lib/meta_inspector/scraper.rb', line 18 def initialize(url, = {}) url = encode_url(url) @url = URI.parse(url).scheme.nil? ? 'http://' + url : url @scheme = URI.parse(@url).scheme @host = URI.parse(@url).host @root_url = "#{@scheme}://#{@host}/" @timeout = [:timeout] || 20 @data = Hashie::Rash.new('url' => @url) @errors = [] @html_content_only = [:html_content_only] || false end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(method_name) ⇒ Object
Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type
It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/meta_inspector/scraper.rb', line 132 def method_missing(method_name) if method_name.to_s =~ /^meta_(.*)/ key = $1 #special treatment for og: if key =~ /^og_(.*)/ key = "og:#{$1}" end unless @data. @data..name! @data..property! parsed_document.xpath("//meta").each do |element| if element.attributes["content"] if element.attributes["name"] @data..name[element.attributes["name"].value.downcase] = element.attributes["content"].value end if element.attributes["property"] @data..property[element.attributes["property"].value.downcase] = element.attributes["content"].value end end end end @data..name && (@data..name[key.downcase]) || (@data..property && @data..property[key.downcase]) else super end end |
Instance Attribute Details
#content_type ⇒ Object (readonly)
Returns the value of attribute content_type.
11 12 13 |
# File 'lib/meta_inspector/scraper.rb', line 11 def content_type @content_type end |
#errors ⇒ Object (readonly)
Returns the value of attribute errors.
11 12 13 |
# File 'lib/meta_inspector/scraper.rb', line 11 def errors @errors end |
#host ⇒ Object (readonly)
Returns the value of attribute host.
11 12 13 |
# File 'lib/meta_inspector/scraper.rb', line 11 def host @host end |
#root_url ⇒ Object (readonly)
Returns the value of attribute root_url.
11 12 13 |
# File 'lib/meta_inspector/scraper.rb', line 11 def root_url @root_url end |
#scheme ⇒ Object (readonly)
Returns the value of attribute scheme.
11 12 13 |
# File 'lib/meta_inspector/scraper.rb', line 11 def scheme @scheme end |
#url ⇒ Object (readonly)
Returns the value of attribute url.
11 12 13 |
# File 'lib/meta_inspector/scraper.rb', line 11 def url @url end |
Instance Method Details
#charset ⇒ Object
Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />
81 82 83 |
# File 'lib/meta_inspector/scraper.rb', line 81 def charset @data.charset ||= ( || charset_from_content_type) end |
#description ⇒ Object
A description getter that first checks for a meta description and if not present will guess by looking grabbing the first paragraph > 120 characters
38 39 40 |
# File 'lib/meta_inspector/scraper.rb', line 38 def description .nil? ? secondary_description : end |
#document ⇒ Object
Returns the original, unparsed document
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
# File 'lib/meta_inspector/scraper.rb', line 105 def document @document ||= Timeout::timeout(@timeout) { req = open(@url) @content_type = @data.content_type = req.content_type if @html_content_only && @content_type != "text/html" raise "The url provided contains #{@content_type} content instead of text/html content" end req.read } rescue SocketError add_fatal_error 'Socket error: The url provided does not exist or is temporarily unavailable' rescue TimeoutError add_fatal_error 'Timeout!!!' rescue Exception => e add_fatal_error "Scraping exception: #{e.}" end |
#external_links ⇒ Object
External links found on the page, as absolute URLs
53 54 55 |
# File 'lib/meta_inspector/scraper.rb', line 53 def external_links @data.external_links ||= links.select {|link| URI.parse(link).host != @host } end |
#feed ⇒ Object
Returns the parsed document meta rss links
63 64 65 66 67 68 69 |
# File 'lib/meta_inspector/scraper.rb', line 63 def feed @data.feed ||= parsed_document.xpath("//link").select{ |link| link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/ }.map { |link| absolutify_url(link.attributes["href"].value) }.first rescue nil end |
#image ⇒ Object
Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/
74 75 76 |
# File 'lib/meta_inspector/scraper.rb', line 74 def image end |
#images ⇒ Object
Images found on the page, as absolute URLs
58 59 60 |
# File 'lib/meta_inspector/scraper.rb', line 58 def images @data.images ||= parsed_images.map{ |i| absolutify_url(i) } end |
#internal_links ⇒ Object
Internal links found on the page, as absolute URLs
48 49 50 |
# File 'lib/meta_inspector/scraper.rb', line 48 def internal_links @data.internal_links ||= links.select {|link| URI.parse(link).host == @host } end |
#links ⇒ Object
Links found on the page, as absolute URLs
43 44 45 |
# File 'lib/meta_inspector/scraper.rb', line 43 def links @data.links ||= parsed_links.map { |l| absolutify_url(unrelativize_url(l)) } end |
#parsed? ⇒ Boolean
Returns true if parsing has been successful
93 94 95 |
# File 'lib/meta_inspector/scraper.rb', line 93 def parsed? !@parsed_document.nil? end |
#parsed_document ⇒ Object
Returns the whole parsed document
98 99 100 101 102 |
# File 'lib/meta_inspector/scraper.rb', line 98 def parsed_document @parsed_document ||= Nokogiri::HTML(document) rescue Exception => e add_fatal_error "Parsing exception: #{e.}" end |
#title ⇒ Object
Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_tite tag
32 33 34 |
# File 'lib/meta_inspector/scraper.rb', line 32 def title @data.title ||= parsed_document.css('title').inner_html.gsub(/\t|\n|\r/, '') rescue nil end |
#to_hash ⇒ Object
Returns all parsed data as a nested Hash
86 87 88 89 90 |
# File 'lib/meta_inspector/scraper.rb', line 86 def to_hash # TODO: find a better option to populate the data to the Hash image;images;feed;links;charset;title;;internal_links;external_links @data.to_hash end |