MetaInspector

MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you metadata from it.

Dependencies

MetaInspector uses the nokogiri gem to parse HTML. You can install it from github.

Run the following if you haven’t already:

gem sources -a http://gems.github.com

Then install the gem:

sudo gem install tenderlove-nokogiri

If you’re on Ubuntu, you might need to install these packages before installing nokogiri:

sudo aptitude install libxslt-dev libxml2 libxml2-dev

Installation

Run the following if you haven’t already:

gem sources -a http://gems.github.com

Then install the gem:

sudo gem install jaimeiniesta-metainspector

Usage

Initialize a MetaInspector instance with an URL like this:

page = MetaInspector.new('http://pagerankalert.com')

Once scraped, you can see the scraped data like this:

page.address       # URL of the page
page.title         # title of the page, as string
page.description   # meta description, as string
page.keywords      # meta keywords, as string
page.links         # array of strings, with every link found on the page

The full scraped document if accessible from:

page.document # Nokogiri doc that you can use it to get any element from the page

Examples

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

$ irb
>> require 'metainspector'
=> true

>> page = MetaInspector.new('http://pagerankalert.com')
=> #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>

>> page.title
=> "PageRankAlert.com :: Track your pagerank changes"

>> page.description
=> "Track your PageRank(TM) changes and receive alert by email"

>> page.keywords
=> "pagerank, seo, optimization, google"

>> page.links.size
=> 31

>> page.links[30]
=> "http://www.nuvio.cz/"

>> page.document.class
=> Nokogiri::HTML::Document

To Do

Copyright © 2009 Jaime Iniesta, released under the MIT license