grubby

Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API summary below, or browse the full documentation.

Examples

The following example scrapes the Hacker News front page:

require "grubby"

class HackerNews < Grubby::PageScraper

  scrapes(:items) do
    page.search!(".athing").map{|item| HackerNewsItem.new(item) }
  end

end

class HackerNewsItem < Grubby::Scraper

  scrapes(:title) { @row1.at!(".storylink").text }
  scrapes(:submitter) { @row2.at!(".hnuser").text }
  scrapes(:story_uri) { URI.join(@base_uri, @row1.at!(".storylink")["href"]) }
  scrapes(:comments_uri) { URI.join(@base_uri, @row2.at!(".age a")["href"]) }

  def initialize(source)
    @row1 = source
    @row2 = source.next_sibling
    @base_uri = source.document.url
    super
  end

end

grubby = Grubby.new

# The following line will raise an exception if anything goes wrong
# during the scraping process.  For example, if the structure of the
# HTML does not match expectations, either due to a bad assumption or
# due to a site-wide change, the script will terminate immediately with
# a relevant error message.  This prevents bad values from propogating
# and causing hard-to-trace errors.
hn = HackerNews.new(grubby.get("https://news.ycombinator.com/news"))

puts hn.items.take(10).map(&:title) # your scraping logic goes here

Core API

Grubby
Scraper
- .fields
- .scrapes
- #[]
- #source
- #to_h
PageScraper
- #page
JsonScraper
- #json
Nokogiri::XML::Searchable
- #at!
- #search!
Mechanize::Page
- #at!
- #search!
Mechanize::Page::Link
- #to_absolute_uri

Supplemental API

grubby uses several gems which extend core Ruby objects with convenience methods. When you import grubby you automatically make these methods available. See each gem below for its specific API documentation:

Installation

Install from Ruby Gems:

$ gem install grubby

Then require in your Ruby script:

require "grubby"

Contributing

Run rake test to run the tests. You can also run rake irb for an interactive prompt that pre-loads the project code.

License

MIT License