Class: MetaInspector::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/meta_inspector/scraper.rb

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, options = {}) ⇒ Scraper

Initializes a new instance of MetaInspector, setting the URL to the one given If no scheme given, set it to http:// by default Options:

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false



18
19
20
21
22
23
24
25
26
27
28
# File 'lib/meta_inspector/scraper.rb', line 18

def initialize(url, options = {})
  url       = encode_url(url)
  @url      = URI.parse(url).scheme.nil? ? 'http://' + url : url
  @scheme   = URI.parse(@url).scheme
  @host     = URI.parse(@url).host
  @root_url = "#{@scheme}://#{@host}/"
  @timeout  = options[:timeout] || 20
  @data     = Hashie::Rash.new('url' => @url)
  @errors   = []
  @html_content_only = options[:html_content_only] || false
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_name) ⇒ Object

Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type

It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods



132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/meta_inspector/scraper.rb', line 132

def method_missing(method_name)
  if method_name.to_s =~ /^meta_(.*)/
    key = $1
    #special treatment for og:
    if key =~ /^og_(.*)/
      key = "og:#{$1}"
    end
    unless @data.meta
      @data.meta!.name!
      @data.meta!.property!
      parsed_document.xpath("//meta").each do |element|
        if element.attributes["content"]
          if element.attributes["name"]
            @data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value
          end

          if element.attributes["property"]
            @data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value
          end
        end
      end
    end
    @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
  else
    super
  end
end

Instance Attribute Details

#content_typeObject (readonly)

Returns the value of attribute content_type.



11
12
13
# File 'lib/meta_inspector/scraper.rb', line 11

def content_type
  @content_type
end

#errorsObject (readonly)

Returns the value of attribute errors.



11
12
13
# File 'lib/meta_inspector/scraper.rb', line 11

def errors
  @errors
end

#hostObject (readonly)

Returns the value of attribute host.



11
12
13
# File 'lib/meta_inspector/scraper.rb', line 11

def host
  @host
end

#root_urlObject (readonly)

Returns the value of attribute root_url.



11
12
13
# File 'lib/meta_inspector/scraper.rb', line 11

def root_url
  @root_url
end

#schemeObject (readonly)

Returns the value of attribute scheme.



11
12
13
# File 'lib/meta_inspector/scraper.rb', line 11

def scheme
  @scheme
end

#urlObject (readonly)

Returns the value of attribute url.



11
12
13
# File 'lib/meta_inspector/scraper.rb', line 11

def url
  @url
end

Instance Method Details

#charsetObject

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />



81
82
83
# File 'lib/meta_inspector/scraper.rb', line 81

def charset
  @data.charset ||= (charset_from_meta_charset || charset_from_content_type)
end

#descriptionObject

A description getter that first checks for a meta description and if not present will guess by looking grabbing the first paragraph > 120 characters



38
39
40
# File 'lib/meta_inspector/scraper.rb', line 38

def description
  meta_description.nil? ? secondary_description : meta_description
end

#documentObject

Returns the original, unparsed document



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/meta_inspector/scraper.rb', line 105

def document
  @document ||= Timeout::timeout(@timeout) {
    req = open(@url)
    @content_type = @data.content_type = req.content_type

    if @html_content_only && @content_type != "text/html"
       raise "The url provided contains #{@content_type} content instead of text/html content"
    end

    req.read
  }

  rescue SocketError
    add_fatal_error 'Socket error: The url provided does not exist or is temporarily unavailable'
  rescue TimeoutError
    add_fatal_error 'Timeout!!!'
  rescue Exception => e
    add_fatal_error "Scraping exception: #{e.message}"
end

External links found on the page, as absolute URLs



53
54
55
# File 'lib/meta_inspector/scraper.rb', line 53

def external_links
  @data.external_links ||= links.select {|link| URI.parse(link).host != @host }
end

#feedObject

Returns the parsed document meta rss links



63
64
65
66
67
68
69
# File 'lib/meta_inspector/scraper.rb', line 63

def feed
  @data.feed ||= parsed_document.xpath("//link").select{ |link|
      link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
    }.map { |link|
      absolutify_url(link.attributes["href"].value)
    }.first rescue nil
end

#imageObject

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/



74
75
76
# File 'lib/meta_inspector/scraper.rb', line 74

def image
  meta_og_image
end

#imagesObject

Images found on the page, as absolute URLs



58
59
60
# File 'lib/meta_inspector/scraper.rb', line 58

def images
  @data.images ||= parsed_images.map{ |i| absolutify_url(i) }
end

Internal links found on the page, as absolute URLs



48
49
50
# File 'lib/meta_inspector/scraper.rb', line 48

def internal_links
  @data.internal_links ||= links.select {|link| URI.parse(link).host == @host }
end

Links found on the page, as absolute URLs



43
44
45
# File 'lib/meta_inspector/scraper.rb', line 43

def links
  @data.links ||= parsed_links.map { |l| absolutify_url(unrelativize_url(l)) }
end

#parsed?Boolean

Returns true if parsing has been successful

Returns:

  • (Boolean)


93
94
95
# File 'lib/meta_inspector/scraper.rb', line 93

def parsed?
  !@parsed_document.nil?
end

#parsed_documentObject

Returns the whole parsed document



98
99
100
101
102
# File 'lib/meta_inspector/scraper.rb', line 98

def parsed_document
  @parsed_document ||= Nokogiri::HTML(document)
  rescue Exception => e
    add_fatal_error "Parsing exception: #{e.message}"
end

#titleObject

Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_tite tag



32
33
34
# File 'lib/meta_inspector/scraper.rb', line 32

def title
  @data.title ||= parsed_document.css('title').inner_html.gsub(/\t|\n|\r/, '') rescue nil
end

#to_hashObject

Returns all parsed data as a nested Hash



86
87
88
89
90
# File 'lib/meta_inspector/scraper.rb', line 86

def to_hash
  # TODO: find a better option to populate the data to the Hash
  image;images;feed;links;charset;title;meta_keywords;internal_links;external_links
  @data.to_hash
end