Class: MetaInspector::Scraper

Inherits:

Object

Object
MetaInspector::Scraper

show all

Defined in:: lib/meta_inspector/scraper.rb

Instance Attribute Summary collapse

#content_type ⇒ Object readonly

Returns the value of attribute content_type.
#errors ⇒ Object readonly

Returns the value of attribute errors.
#host ⇒ Object readonly

Returns the value of attribute host.
#root_url ⇒ Object readonly

Returns the value of attribute root_url.
#scheme ⇒ Object readonly

Returns the value of attribute scheme.
#url ⇒ Object readonly

Returns the value of attribute url.

Instance Method Summary collapse

#charset ⇒ Object

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />.
#description ⇒ Object

A description getter that first checks for a meta description and if not present will guess by looking grabbing the first paragraph > 120 characters.
#document ⇒ Object

Returns the original, unparsed document.
#external_links ⇒ Object

External links found on the page, as absolute URLs.
#feed ⇒ Object

Returns the parsed document meta rss links.
#image ⇒ Object

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/.
#images ⇒ Object

Images found on the page, as absolute URLs.
#initialize(url, options = {}) ⇒ Scraper constructor

Initializes a new instance of MetaInspector, setting the URL to the one given If no scheme given, set it to http:// by default Options: => timeout: defaults to 20 seconds => html_content_type_only: if an exception should be raised if request content-type is not text/html.
#internal_links ⇒ Object

Internal links found on the page, as absolute URLs.
#links ⇒ Object

Links found on the page, as absolute URLs.
#method_missing(method_name) ⇒ Object

Scrapers for all meta_tags in the form of “meta_name” are automatically defined.
#parsed? ⇒ Boolean

Returns true if parsing has been successful.
#parsed_document ⇒ Object

Returns the whole parsed document.
#title ⇒ Object

Returns the parsed document title, from the content of the <title> tag.
#to_hash ⇒ Object

Returns all parsed data as a nested Hash.

Constructor Details

#initialize(url, options = {}) ⇒ `Scraper`

Initializes a new instance of MetaInspector, setting the URL to the one given If no scheme given, set it to http:// by default Options:

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

# File 'lib/meta_inspector/scraper.rb', line 18

def initialize(url, options = {})
  url       = encode_url(url)
  @url      = URI.parse(url).scheme.nil? ? 'http://' + url : url
  @scheme   = URI.parse(@url).scheme
  @host     = URI.parse(@url).host
  @root_url = "#{@scheme}://#{@host}/"
  @timeout  = options[:timeout] || 20
  @data     = Hashie::Rash.new('url' => @url)
  @errors   = []
  @html_content_only = options[:html_content_only] || false
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_name) ⇒ `Object`

Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type

It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods

# File 'lib/meta_inspector/scraper.rb', line 132

def method_missing(method_name)
  if method_name.to_s =~ /^meta_(.*)/
    key = $1
    #special treatment for og:
    if key =~ /^og_(.*)/
      key = "og:#{$1}"
    end
    unless @data.meta
      @data.meta!.name!
      @data.meta!.property!
      parsed_document.xpath("//meta").each do |element|
        if element.attributes["content"]
          if element.attributes["name"]
            @data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value
          end

          if element.attributes["property"]
            @data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value
          end
        end
      end
    end
    @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
  else
    super
  end
end

Instance Attribute Details

#content_type ⇒ `Object` (readonly)

Returns the value of attribute content_type.



11
12
13

# File 'lib/meta_inspector/scraper.rb', line 11

def content_type
  @content_type
end

#errors ⇒ `Object` (readonly)

Returns the value of attribute errors.



11
12
13

# File 'lib/meta_inspector/scraper.rb', line 11

def errors
  @errors
end

#host ⇒ `Object` (readonly)

Returns the value of attribute host.



11
12
13

# File 'lib/meta_inspector/scraper.rb', line 11

def host
  @host
end

#root_url ⇒ `Object` (readonly)

Returns the value of attribute root_url.



11
12
13

# File 'lib/meta_inspector/scraper.rb', line 11

def root_url
  @root_url
end

#scheme ⇒ `Object` (readonly)

Returns the value of attribute scheme.



11
12
13

# File 'lib/meta_inspector/scraper.rb', line 11

def scheme
  @scheme
end

#url ⇒ `Object` (readonly)

Returns the value of attribute url.



11
12
13

# File 'lib/meta_inspector/scraper.rb', line 11

def url
  @url
end

Instance Method Details

#charset ⇒ `Object`

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />



81
82
83

# File 'lib/meta_inspector/scraper.rb', line 81

def charset
  @data.charset ||= (charset_from_meta_charset || charset_from_content_type)
end

#description ⇒ `Object`

A description getter that first checks for a meta description and if not present will guess by looking grabbing the first paragraph > 120 characters



38
39
40

# File 'lib/meta_inspector/scraper.rb', line 38

def description
  meta_description.nil? ? secondary_description : meta_description
end

#document ⇒ `Object`

Returns the original, unparsed document

# File 'lib/meta_inspector/scraper.rb', line 105

def document
  @document ||= Timeout::timeout(@timeout) {
    req = open(@url)
    @content_type = @data.content_type = req.content_type

    if @html_content_only && @content_type != "text/html"
       raise "The url provided contains #{@content_type} content instead of text/html content"
    end

    req.read
  }

  rescue SocketError
    add_fatal_error 'Socket error: The url provided does not exist or is temporarily unavailable'
  rescue TimeoutError
    add_fatal_error 'Timeout!!!'
  rescue Exception => e
    add_fatal_error "Scraping exception: #{e.message}"
end

#external_links ⇒ `Object`

External links found on the page, as absolute URLs



53
54
55

# File 'lib/meta_inspector/scraper.rb', line 53

def external_links
  @data.external_links ||= links.select {|link| URI.parse(link).host != @host }
end

#feed ⇒ `Object`

Returns the parsed document meta rss links

# File 'lib/meta_inspector/scraper.rb', line 63

def feed
  @data.feed ||= parsed_document.xpath("//link").select{ |link|
      link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
    }.map { |link|
      absolutify_url(link.attributes["href"].value)
    }.first rescue nil
end

#image ⇒ `Object`

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/



74
75
76

# File 'lib/meta_inspector/scraper.rb', line 74

def image
  meta_og_image
end

#images ⇒ `Object`

Images found on the page, as absolute URLs



58
59
60

# File 'lib/meta_inspector/scraper.rb', line 58

def images
  @data.images ||= parsed_images.map{ |i| absolutify_url(i) }
end

#internal_links ⇒ `Object`

Internal links found on the page, as absolute URLs



48
49
50

# File 'lib/meta_inspector/scraper.rb', line 48

def internal_links
  @data.internal_links ||= links.select {|link| URI.parse(link).host == @host }
end

#links ⇒ `Object`

Links found on the page, as absolute URLs



43
44
45

# File 'lib/meta_inspector/scraper.rb', line 43

def links
  @data.links ||= parsed_links.map { |l| absolutify_url(unrelativize_url(l)) }
end

#parsed? ⇒ `Boolean`

Returns true if parsing has been successful

Returns:

(Boolean)



93
94
95

# File 'lib/meta_inspector/scraper.rb', line 93

def parsed?
  !@parsed_document.nil?
end

#parsed_document ⇒ `Object`

Returns the whole parsed document

# File 'lib/meta_inspector/scraper.rb', line 98

def parsed_document
  @parsed_document ||= Nokogiri::HTML(document)
  rescue Exception => e
    add_fatal_error "Parsing exception: #{e.message}"
end

#title ⇒ `Object`

Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_tite tag



32
33
34

# File 'lib/meta_inspector/scraper.rb', line 32

def title
  @data.title ||= parsed_document.css('title').inner_html.gsub(/\t|\n|\r/, '') rescue nil
end

#to_hash ⇒ `Object`

Returns all parsed data as a nested Hash

# File 'lib/meta_inspector/scraper.rb', line 86

def to_hash
  # TODO: find a better option to populate the data to the Hash
  image;images;feed;links;charset;title;meta_keywords;internal_links;external_links
  @data.to_hash
end

Class: MetaInspector::Scraper

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, options = {}) ⇒ Scraper

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

Dynamic Method Handling

#method_missing(method_name) ⇒ Object

Instance Attribute Details

#content_type ⇒ Object (readonly)

#errors ⇒ Object (readonly)

#host ⇒ Object (readonly)

#root_url ⇒ Object (readonly)

#scheme ⇒ Object (readonly)

#url ⇒ Object (readonly)

Instance Method Details

#charset ⇒ Object

#description ⇒ Object

#document ⇒ Object

#external_links ⇒ Object

#feed ⇒ Object

#image ⇒ Object

#images ⇒ Object

#internal_links ⇒ Object

#links ⇒ Object

#parsed? ⇒ Boolean

#parsed_document ⇒ Object

#title ⇒ Object

#to_hash ⇒ Object