grubby

Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API listing below, or browse the full documentation.

Examples

The following code scrapes stories from the Hacker News front page:

require "grubby"

class HackerNews < Grubby::PageScraper
  scrapes(:items) do
    page.search!(".athing").map{|element| Item.new(element) }
  end

  class Item < Grubby::Scraper
    scrapes(:story_link){ source.at!("a.storylink") }

    scrapes(:story_url){ expand_url(story_link["href"]) }

    scrapes(:title){ story_link.text }

    scrapes(:comments_link, optional: true) do
      source.next_sibling.search!(".subtext a").find do |link|
        link.text.match?(/comment|discuss/)
      end
    end

    scrapes(:comments_url, if: :comments_link) do
      expand_url(comments_link["href"])
    end

    scrapes(:comment_count, if: :comments_link) do
      comments_link.text.to_i
    end

    def expand_url(url)
      url.include?("://") ? url : source.document.uri.merge(url).to_s
    end
  end
end

# The following line will raise an exception if anything goes wrong
# during the scraping process.  For example, if the structure of the
# HTML does not match expectations due to a site change, the script will
# terminate immediately with a helpful error message.  This prevents bad
# data from propagating and causing hard-to-trace errors.
hn = HackerNews.scrape("https://news.ycombinator.com/news")

# Your processing logic goes here:
hn.items.take(10).each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Hacker News also offers a JSON API, which may be more robust for scraping purposes. grubby can scrape JSON just as well:

require "grubby"

class HackerNews < Grubby::JsonScraper
  scrapes(:items) do
    # API returns array of top 500 item IDs, so limit as necessary
    json.take(10).map do |item_id|
      Item.scrape("https://hacker-news.firebaseio.com/v0/item/#{item_id}.json")
    end
  end

  class Item < Grubby::JsonScraper
    scrapes(:story_url){ json["url"] || hn_url }

    scrapes(:title){ json["title"] }

    scrapes(:comments_url, optional: true) do
      hn_url if json["descendants"]
    end

    scrapes(:comment_count, optional: true) do
      json["descendants"]&.to_i
    end

    def hn_url
      "https://news.ycombinator.com/item?id=#{json["id"]}"
    end
  end
end

hn = HackerNews.scrape("https://hacker-news.firebaseio.com/v0/topstories.json")

# Your processing logic goes here:
hn.items.each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Core API

Grubby
Scraper
- .each
- .scrape
- .scrapes
- #[]
- #to_h
PageScraper
- .scrape_file
- #page
JsonScraper
- .scrape_file
- #json
Mechanize::File
- #save_to
- #save_to!
Mechanize::Page
- #at!
- #search!
Mechanize::Page::Link
- #to_absolute_uri
URI
- #basename
- #query_param

Auxiliary API

grubby loads several gems that extend Ruby objects with utility methods. Some of those methods are listed below. See each gem's documentation for a complete API listing.

Installation

Install the gem:

$ gem install grubby

Then require in your Ruby code:

require "grubby"

Contributing

Run rake test to run the tests.

License

MIT License