grubby
Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API summary below, or browse the full documentation.
Examples
The following example scrapes the Hacker News front page:
require "grubby"
class HackerNews < Grubby::PageScraper
scrapes(:items) do
page.search!(".athing").map{|item| HackerNewsItem.new(item) }
end
end
class HackerNewsItem < Grubby::Scraper
scrapes(:title) { @row1.at!(".storylink").text }
scrapes(:submitter) { @row2.at!(".hnuser").text }
scrapes(:story_uri) { URI.join(@base_uri, @row1.at!(".storylink")["href"]) }
scrapes(:comments_uri) { URI.join(@base_uri, @row2.at!(".age a")["href"]) }
def initialize(source)
@row1 = source
@row2 = source.next_sibling
@base_uri = source.document.url
super
end
end
grubby = Grubby.new
# The following line will raise an exception if anything goes wrong
# during the scraping process. For example, if the structure of the
# HTML does not match expectations, either due to a bad assumption or
# due to a site-wide change, the script will terminate immediately with
# a relevant error message. This prevents bad values from propogating
# and causing hard-to-trace errors.
hn = HackerNews.new(grubby.get("https://news.ycombinator.com/news"))
puts hn.items.take(10).map(&:title) # your scraping logic goes here
Core API
- Grubby
- Scraper
- PageScraper
- JsonScraper
- Nokogiri::XML::Searchable
- Mechanize::Page
- Mechanize::Page::Link
Supplemental API
grubby uses several gems which extend core Ruby objects with convenience methods. When you import grubby you automatically make these methods available. See each gem below for its specific API documentation:
Installation
Install from Ruby Gems:
$ gem install grubby
Then require in your Ruby script:
require "grubby"
Contributing
Run rake test to run the tests. You can also run rake irb for an
interactive prompt that pre-loads the project code.