Class: MetaInspector::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/meta_inspector/scraper.rb

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, options = {}) ⇒ Scraper

Initializes a new instance of MetaInspector, setting the URL to the one given Options:

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP

> document: the html of the url as a string

> verbose: if the errors should be logged to the screen



22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/meta_inspector/scraper.rb', line 22

def initialize(url, options = {})
  options   = defaults.merge(options)

  @url      = with_default_scheme(encode_url(url))
  @scheme   = URI.parse(@url).scheme
  @host     = URI.parse(@url).host
  @root_url = "#{@scheme}://#{@host}/"
  @timeout  = options[:timeout]
  @data     = Hashie::Rash.new
  @errors   = []
  @html_content_only  = options[:html_content_only]
  @allow_redirections = options[:allow_redirections]
  @verbose            = options[:verbose]
  @document           = options[:document]
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_name) ⇒ Object (private)

Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type

It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods



151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/meta_inspector/scraper.rb', line 151

def method_missing(method_name)
  if method_name.to_s =~ /^meta_(.*)/
    key = $1
    key = "og:#{$1}" if key =~ /^og_(.*)/ # special treatment for og:

    

    @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
  else
    super
  end
end

Instance Attribute Details

#allow_redirectionsObject (readonly)

Returns the value of attribute allow_redirections.



13
14
15
# File 'lib/meta_inspector/scraper.rb', line 13

def allow_redirections
  @allow_redirections
end

#content_typeObject (readonly)

Returns the content_type of the fetched document



125
126
127
# File 'lib/meta_inspector/scraper.rb', line 125

def content_type
  @content_type
end

#errorsObject (readonly)

Returns the value of attribute errors.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def errors
  @errors
end

#hostObject (readonly)

Returns the value of attribute host.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def host
  @host
end

#html_content_onlyObject (readonly)

Returns the value of attribute html_content_only.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def html_content_only
  @html_content_only
end

#root_urlObject (readonly)

Returns the value of attribute root_url.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def root_url
  @root_url
end

#schemeObject (readonly)

Returns the value of attribute scheme.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def scheme
  @scheme
end

#timeoutObject (readonly)

Returns the value of attribute timeout.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def timeout
  @timeout
end

#urlObject (readonly)

Returns the value of attribute url.



12
13
14
# File 'lib/meta_inspector/scraper.rb', line 12

def url
  @url
end

#verboseObject (readonly)

Returns the value of attribute verbose.



13
14
15
# File 'lib/meta_inspector/scraper.rb', line 13

def verbose
  @verbose
end

Instance Method Details

#charsetObject

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />



85
86
87
# File 'lib/meta_inspector/scraper.rb', line 85

def charset
  @charset ||= (charset_from_meta_charset || charset_from_content_type)
end

#descriptionObject

A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters



46
47
48
# File 'lib/meta_inspector/scraper.rb', line 46

def description
  meta_description.nil? ? secondary_description : meta_description
end

#documentObject

Returns the original, unparsed document



114
115
116
117
118
119
120
121
122
# File 'lib/meta_inspector/scraper.rb', line 114

def document
  @document ||= if html_content_only && content_type != "text/html"
                  raise "The url provided contains #{content_type} content instead of text/html content" and nil
                else
                  request.read
                end
  rescue Exception => e
    add_fatal_error "Scraping exception: #{e.message}"
end

External links found on the page, as absolute URLs



61
62
63
# File 'lib/meta_inspector/scraper.rb', line 61

def external_links
  @external_links ||= links.select {|link| host_from_url(link) != host }
end

#feedObject

Returns the parsed document meta rss link



78
79
80
# File 'lib/meta_inspector/scraper.rb', line 78

def feed
  @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
end

#imageObject

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/



73
74
75
# File 'lib/meta_inspector/scraper.rb', line 73

def image
  meta_og_image
end

#imagesObject

Images found on the page, as absolute URLs



66
67
68
# File 'lib/meta_inspector/scraper.rb', line 66

def images
  @images ||= parsed_images.map{ |i| absolutify_url(i) }
end

Internal links found on the page, as absolute URLs



56
57
58
# File 'lib/meta_inspector/scraper.rb', line 56

def internal_links
  @internal_links ||= links.select {|link| host_from_url(link) == host }
end

Links found on the page, as absolute URLs



51
52
53
# File 'lib/meta_inspector/scraper.rb', line 51

def links
  @links ||= parsed_links.map{ |l| absolutify_url(unrelativize_url(l)) }.compact
end

#ok?Boolean

Returns true if there are no errors

Returns:

  • (Boolean)


130
131
132
# File 'lib/meta_inspector/scraper.rb', line 130

def ok?
  errors.empty?
end

#parsed_documentObject

Returns the whole parsed document



107
108
109
110
111
# File 'lib/meta_inspector/scraper.rb', line 107

def parsed_document
  @parsed_document ||= Nokogiri::HTML(document)
  rescue Exception => e
    add_fatal_error "Parsing exception: #{e.message}"
end

#titleObject

Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_title tag



40
41
42
# File 'lib/meta_inspector/scraper.rb', line 40

def title
  @title ||= parsed_document.css('title').inner_html.gsub(/\t|\n|\r/, '') rescue nil
end

#to_hashObject

Returns all parsed data as a nested Hash



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/meta_inspector/scraper.rb', line 90

def to_hash
  

  {
    'url' => url,
    'title' => title,
    'links' => links,
    'internal_links' => internal_links,
    'external_links' => external_links,
    'images' => images,
    'charset' => charset,
    'feed' => feed,
    'content_type' => content_type
  }.merge @data.to_hash
end