Class: MetaInspector::Scraper
- Inherits:
-
Object
- Object
- MetaInspector::Scraper
- Defined in:
- lib/meta_inspector/scraper.rb
Instance Attribute Summary collapse
-
#allow_redirections ⇒ Object
readonly
Returns the value of attribute allow_redirections.
-
#content_type ⇒ Object
readonly
Returns the content_type of the fetched document.
-
#errors ⇒ Object
readonly
Returns the value of attribute errors.
-
#host ⇒ Object
readonly
Returns the value of attribute host.
-
#html_content_only ⇒ Object
readonly
Returns the value of attribute html_content_only.
-
#root_url ⇒ Object
readonly
Returns the value of attribute root_url.
-
#scheme ⇒ Object
readonly
Returns the value of attribute scheme.
-
#timeout ⇒ Object
readonly
Returns the value of attribute timeout.
-
#url ⇒ Object
readonly
Returns the value of attribute url.
-
#verbose ⇒ Object
readonly
Returns the value of attribute verbose.
Instance Method Summary collapse
-
#charset ⇒ Object
Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />.
-
#description ⇒ Object
A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters.
-
#document ⇒ Object
Returns the original, unparsed document.
-
#external_links ⇒ Object
External links found on the page, as absolute URLs.
-
#feed ⇒ Object
Returns the parsed document meta rss link.
-
#image ⇒ Object
Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/.
-
#images ⇒ Object
Images found on the page, as absolute URLs.
-
#initialize(url, options = {}) ⇒ Scraper
constructor
Initializes a new instance of MetaInspector, setting the URL to the one given Options: => timeout: defaults to 20 seconds => html_content_type_only: if an exception should be raised if request content-type is not text/html.
-
#internal_links ⇒ Object
Internal links found on the page, as absolute URLs.
-
#links ⇒ Object
Links found on the page, as absolute URLs.
-
#ok? ⇒ Boolean
Returns true if there are no errors.
-
#parsed_document ⇒ Object
Returns the whole parsed document.
-
#title ⇒ Object
Returns the parsed document title, from the content of the <title> tag.
-
#to_hash ⇒ Object
Returns all parsed data as a nested Hash.
Constructor Details
#initialize(url, options = {}) ⇒ Scraper
Initializes a new instance of MetaInspector, setting the URL to the one given Options:
> timeout: defaults to 20 seconds
> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false
> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP
> document: the html of the url as a string
> verbose: if the errors should be logged to the screen
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# File 'lib/meta_inspector/scraper.rb', line 23 def initialize(url, = {}) = defaults.merge() @url = with_default_scheme(normalize_url(url)) @scheme = URI.parse(@url).scheme @host = URI.parse(@url).host @root_url = "#{@scheme}://#{@host}/" @timeout = [:timeout] @data = Hashie::Rash.new @errors = [] @html_content_only = [:html_content_only] @allow_redirections = [:allow_redirections] @verbose = [:verbose] @document = [:document] end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(method_name) ⇒ Object (private)
Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type
It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods
152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
# File 'lib/meta_inspector/scraper.rb', line 152 def method_missing(method_name) if method_name.to_s =~ /^meta_(.*)/ key = $1 #special treatment for opengraph (og:) and twitter card (twitter:) tags key.gsub!("_",":") if key =~ /^og_(.*)/ || key =~ /^twitter_(.*)/ @data..name && (@data..name[key.downcase]) || (@data..property && @data..property[key.downcase]) else super end end |
Instance Attribute Details
#allow_redirections ⇒ Object (readonly)
Returns the value of attribute allow_redirections.
14 15 16 |
# File 'lib/meta_inspector/scraper.rb', line 14 def allow_redirections @allow_redirections end |
#content_type ⇒ Object (readonly)
Returns the content_type of the fetched document
126 127 128 |
# File 'lib/meta_inspector/scraper.rb', line 126 def content_type @content_type end |
#errors ⇒ Object (readonly)
Returns the value of attribute errors.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def errors @errors end |
#host ⇒ Object (readonly)
Returns the value of attribute host.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def host @host end |
#html_content_only ⇒ Object (readonly)
Returns the value of attribute html_content_only.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def html_content_only @html_content_only end |
#root_url ⇒ Object (readonly)
Returns the value of attribute root_url.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def root_url @root_url end |
#scheme ⇒ Object (readonly)
Returns the value of attribute scheme.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def scheme @scheme end |
#timeout ⇒ Object (readonly)
Returns the value of attribute timeout.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def timeout @timeout end |
#url ⇒ Object (readonly)
Returns the value of attribute url.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def url @url end |
#verbose ⇒ Object (readonly)
Returns the value of attribute verbose.
14 15 16 |
# File 'lib/meta_inspector/scraper.rb', line 14 def verbose @verbose end |
Instance Method Details
#charset ⇒ Object
Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />
86 87 88 |
# File 'lib/meta_inspector/scraper.rb', line 86 def charset @charset ||= ( || charset_from_content_type) end |
#description ⇒ Object
A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters
47 48 49 |
# File 'lib/meta_inspector/scraper.rb', line 47 def description .nil? ? secondary_description : end |
#document ⇒ Object
Returns the original, unparsed document
115 116 117 118 119 120 121 122 123 |
# File 'lib/meta_inspector/scraper.rb', line 115 def document @document ||= if html_content_only && content_type != "text/html" raise "The url provided contains #{content_type} content instead of text/html content" and nil else request.read end rescue Exception => e add_fatal_error "Scraping exception: #{e.message}" end |
#external_links ⇒ Object
External links found on the page, as absolute URLs
62 63 64 |
# File 'lib/meta_inspector/scraper.rb', line 62 def external_links @external_links ||= links.select {|link| host_from_url(link) != host } end |
#feed ⇒ Object
Returns the parsed document meta rss link
79 80 81 |
# File 'lib/meta_inspector/scraper.rb', line 79 def feed @feed ||= (parsed_feed('rss') || parsed_feed('atom')) end |
#image ⇒ Object
Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/
74 75 76 |
# File 'lib/meta_inspector/scraper.rb', line 74 def image || end |
#images ⇒ Object
Images found on the page, as absolute URLs
67 68 69 |
# File 'lib/meta_inspector/scraper.rb', line 67 def images @images ||= parsed_images.map{ |i| absolutify_url(i) } end |
#internal_links ⇒ Object
Internal links found on the page, as absolute URLs
57 58 59 |
# File 'lib/meta_inspector/scraper.rb', line 57 def internal_links @internal_links ||= links.select {|link| host_from_url(link) == host } end |
#links ⇒ Object
Links found on the page, as absolute URLs
52 53 54 |
# File 'lib/meta_inspector/scraper.rb', line 52 def links @links ||= parsed_links.map{ |l| absolutify_url(unrelativize_url(l)) }.compact.uniq end |
#ok? ⇒ Boolean
Returns true if there are no errors
131 132 133 |
# File 'lib/meta_inspector/scraper.rb', line 131 def ok? errors.empty? end |
#parsed_document ⇒ Object
Returns the whole parsed document
108 109 110 111 112 |
# File 'lib/meta_inspector/scraper.rb', line 108 def parsed_document @parsed_document ||= Nokogiri::HTML(document) rescue Exception => e add_fatal_error "Parsing exception: #{e.message}" end |
#title ⇒ Object
Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_title tag
41 42 43 |
# File 'lib/meta_inspector/scraper.rb', line 41 def title @title ||= parsed_document.css('title').inner_text rescue nil end |
#to_hash ⇒ Object
Returns all parsed data as a nested Hash
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/meta_inspector/scraper.rb', line 91 def to_hash { 'url' => url, 'title' => title, 'links' => links, 'internal_links' => internal_links, 'external_links' => external_links, 'images' => images, 'charset' => charset, 'feed' => feed, 'content_type' => content_type }.merge @data.to_hash end |