Class: MetaInspector::Scraper

Inherits:

Object

Object
MetaInspector::Scraper

show all

Defined in:: lib/meta_inspector/scraper.rb

Instance Attribute Summary collapse

#allow_redirections ⇒ Object readonly

Returns the value of attribute allow_redirections.
#content_type ⇒ Object readonly

Returns the content_type of the fetched document.
#errors ⇒ Object readonly

Returns the value of attribute errors.
#host ⇒ Object readonly

Returns the value of attribute host.
#html_content_only ⇒ Object readonly

Returns the value of attribute html_content_only.
#root_url ⇒ Object readonly

Returns the value of attribute root_url.
#scheme ⇒ Object readonly

Returns the value of attribute scheme.
#timeout ⇒ Object readonly

Returns the value of attribute timeout.
#url ⇒ Object readonly

Returns the value of attribute url.
#verbose ⇒ Object readonly

Returns the value of attribute verbose.

Instance Method Summary collapse

#charset ⇒ Object

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />.
#description ⇒ Object

A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters.
#document ⇒ Object

Returns the original, unparsed document.
#external_links ⇒ Object

External links found on the page, as absolute URLs.
#feed ⇒ Object

Returns the parsed document meta rss link.
#image ⇒ Object

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/.
#images ⇒ Object

Images found on the page, as absolute URLs.
#initialize(url, options = {}) ⇒ Scraper constructor

Initializes a new instance of MetaInspector, setting the URL to the one given Options: => timeout: defaults to 20 seconds => html_content_type_only: if an exception should be raised if request content-type is not text/html.
#internal_links ⇒ Object

Internal links found on the page, as absolute URLs.
#links ⇒ Object

Links found on the page, as absolute URLs.
#ok? ⇒ Boolean

Returns true if there are no errors.
#parsed_document ⇒ Object

Returns the whole parsed document.
#title ⇒ Object

Returns the parsed document title, from the content of the <title> tag.
#to_hash ⇒ Object

Returns all parsed data as a nested Hash.

Constructor Details

#initialize(url, options = {}) ⇒ `Scraper`

Initializes a new instance of MetaInspector, setting the URL to the one given Options:

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP

> document: the html of the url as a string

> verbose: if the errors should be logged to the screen

# File 'lib/meta_inspector/scraper.rb', line 23

def initialize(url, options = {})
  options   = defaults.merge(options)

  @url      = with_default_scheme(normalize_url(url))
  @scheme   = URI.parse(@url).scheme
  @host     = URI.parse(@url).host
  @root_url = "#{@scheme}://#{@host}/"
  @timeout  = options[:timeout]
  @data     = Hashie::Rash.new
  @errors   = []
  @html_content_only  = options[:html_content_only]
  @allow_redirections = options[:allow_redirections]
  @verbose            = options[:verbose]
  @document           = options[:document]
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_name) ⇒ `Object` (private)

Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type

It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods

# File 'lib/meta_inspector/scraper.rb', line 152

def method_missing(method_name)
  if method_name.to_s =~ /^meta_(.*)/
    key = $1

  #special treatment for opengraph (og:) and twitter card (twitter:) tags
  key.gsub!("_",":") if key =~ /^og_(.*)/ || key =~ /^twitter_(.*)/

    scrape_meta_data

    @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
  else
    super
  end
end

Instance Attribute Details

#allow_redirections ⇒ `Object` (readonly)

Returns the value of attribute allow_redirections.



14
15
16

# File 'lib/meta_inspector/scraper.rb', line 14

def allow_redirections
  @allow_redirections
end

#content_type ⇒ `Object` (readonly)

Returns the content_type of the fetched document



126
127
128

# File 'lib/meta_inspector/scraper.rb', line 126

def content_type
  @content_type
end

#errors ⇒ `Object` (readonly)

Returns the value of attribute errors.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def errors
  @errors
end

#host ⇒ `Object` (readonly)

Returns the value of attribute host.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def host
  @host
end

#html_content_only ⇒ `Object` (readonly)

Returns the value of attribute html_content_only.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def html_content_only
  @html_content_only
end

#root_url ⇒ `Object` (readonly)

Returns the value of attribute root_url.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def root_url
  @root_url
end

#scheme ⇒ `Object` (readonly)

Returns the value of attribute scheme.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def scheme
  @scheme
end

#timeout ⇒ `Object` (readonly)

Returns the value of attribute timeout.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def timeout
  @timeout
end

#url ⇒ `Object` (readonly)

Returns the value of attribute url.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def url
  @url
end

#verbose ⇒ `Object` (readonly)

Returns the value of attribute verbose.



14
15
16

# File 'lib/meta_inspector/scraper.rb', line 14

def verbose
  @verbose
end

Instance Method Details

#charset ⇒ `Object`

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />



86
87
88

# File 'lib/meta_inspector/scraper.rb', line 86

def charset
  @charset ||= (charset_from_meta_charset || charset_from_content_type)
end

#description ⇒ `Object`

A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters



47
48
49

# File 'lib/meta_inspector/scraper.rb', line 47

def description
  meta_description.nil? ? secondary_description : meta_description
end

#document ⇒ `Object`

Returns the original, unparsed document

# File 'lib/meta_inspector/scraper.rb', line 115

def document
  @document ||= if html_content_only && content_type != "text/html"
                  raise "The url provided contains #{content_type} content instead of text/html content" and nil
                else
                  request.read
                end
  rescue Exception => e
    add_fatal_error "Scraping exception: #{e.message}"
end

#external_links ⇒ `Object`

External links found on the page, as absolute URLs



62
63
64

# File 'lib/meta_inspector/scraper.rb', line 62

def external_links
  @external_links ||= links.select {|link| host_from_url(link) != host }
end

#feed ⇒ `Object`

Returns the parsed document meta rss link



79
80
81

# File 'lib/meta_inspector/scraper.rb', line 79

def feed
  @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
end

#image ⇒ `Object`

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/



74
75
76

# File 'lib/meta_inspector/scraper.rb', line 74

def image
  meta_og_image || meta_twitter_image
end

#images ⇒ `Object`

Images found on the page, as absolute URLs



67
68
69

# File 'lib/meta_inspector/scraper.rb', line 67

def images
  @images ||= parsed_images.map{ |i| absolutify_url(i) }
end

#internal_links ⇒ `Object`

Internal links found on the page, as absolute URLs



57
58
59

# File 'lib/meta_inspector/scraper.rb', line 57

def internal_links
  @internal_links ||= links.select {|link| host_from_url(link) == host }
end

#links ⇒ `Object`

Links found on the page, as absolute URLs



52
53
54

# File 'lib/meta_inspector/scraper.rb', line 52

def links
  @links ||= parsed_links.map{ |l| absolutify_url(unrelativize_url(l)) }.compact.uniq
end

#ok? ⇒ `Boolean`

Returns true if there are no errors

Returns:

(Boolean)



131
132
133

# File 'lib/meta_inspector/scraper.rb', line 131

def ok?
  errors.empty?
end

#parsed_document ⇒ `Object`

Returns the whole parsed document

# File 'lib/meta_inspector/scraper.rb', line 108

def parsed_document
  @parsed_document ||= Nokogiri::HTML(document)
  rescue Exception => e
    add_fatal_error "Parsing exception: #{e.message}"
end

#title ⇒ `Object`

Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_title tag



41
42
43

# File 'lib/meta_inspector/scraper.rb', line 41

def title
  @title ||= parsed_document.css('title').inner_text rescue nil
end

#to_hash ⇒ `Object`

Returns all parsed data as a nested Hash

# File 'lib/meta_inspector/scraper.rb', line 91

def to_hash
  scrape_meta_data

  {
    'url' => url,
    'title' => title,
    'links' => links,
    'internal_links' => internal_links,
    'external_links' => external_links,
    'images' => images,
    'charset' => charset,
    'feed' => feed,
    'content_type' => content_type
  }.merge @data.to_hash
end

Class: MetaInspector::Scraper

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, options = {}) ⇒ Scraper

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP

> document: the html of the url as a string

> verbose: if the errors should be logged to the screen

Dynamic Method Handling

#method_missing(method_name) ⇒ Object (private)

Instance Attribute Details

#allow_redirections ⇒ Object (readonly)

#content_type ⇒ Object (readonly)

#errors ⇒ Object (readonly)

#host ⇒ Object (readonly)

#html_content_only ⇒ Object (readonly)

#root_url ⇒ Object (readonly)

#scheme ⇒ Object (readonly)

#timeout ⇒ Object (readonly)

#url ⇒ Object (readonly)

#verbose ⇒ Object (readonly)

Instance Method Details

#charset ⇒ Object

#description ⇒ Object

#document ⇒ Object

#external_links ⇒ Object

#feed ⇒ Object

#image ⇒ Object

#images ⇒ Object

#internal_links ⇒ Object

#links ⇒ Object

#ok? ⇒ Boolean

#parsed_document ⇒ Object

#title ⇒ Object

#to_hash ⇒ Object