Sunbro

Some code that I use to crawl the web at scale with Poltergeist and PhantomJS (cf. stretched.io). Uses a bunch of code from the venerable anemone gem. Released in the spirit of jolly cooperation.

Installation

Add this line to your application's Gemfile:

gem 'sunbro'

And then execute:

$ bundle

Or install it yourself as:

$ gem install sunbro

Usage

I use sunbro to crawl the web at scale via Sidekiq on EC2. I've found that web scraping with capybara/poltergeist + phantomjs is a giant pain on JRuby (for various reasons that you'll encounter once you try it), and this gem is basically my collection of fixes that makes it actually work. And it works pretty well; I use in production to crawl 230 sites and counting.

Here's an example of a worker that looks something like what you might find in my code:

class CrawlerWorker

  def perform(opts)
    @connection = Sunbro::Connection.new
    return unless @links = opts[:links]

    links.each do |link|
      next unless page = @connection.get_page(link)
      puts "Page #{page.url} returned code #{page.code} with body size #{page.body.size}"
    end

  ensure
    @connection.close
  end

end

The above uses net-http to fetch connections, and it pools them. This is all you need most of the time. However, if you're scraping a page that is AJAX-heavy, that's where you'll get the most out of sunbro. To use phantomjs to scrape a page, you'll want to call connection.render_page(link). This renders the JS on the page, but doesn't download any images.

The one option to either get_page or render_page is :force_format, can be one of :html, :xml, or :auto. If the option is set to :html, then Nokogiri::HTML will be used to parse page.body; if it's set to :xml, then Nokogiri::XML is used. If it's set to :auto or nil, Nokogiri.parse is called.

Configuration

You can configure a few options in a config/initializers/sunbro.rb file, as follows:

Sunbro::Settings.configure do |config|
  config.user_agent = ENV['USER_AGENT_STRING1']
  config.phantomjs_user_agent = ENV['USER_AGENT_STRING2']
  config.page_format = :auto
end

PhantomJS zombie process monkey patch

I use the following monkey patch for PhantomJS, because it has zombie process issues when it comes to JRuby. This monkey patch kills some minor PhantomJS functionality that I don't use, and you can read more about what it does and why, in this blog post.

I put this in config/initializers/phantomjs.rb

require "capybara"
require "capybara/poltergeist"
require "capybara/poltergeist/utility"

module Capybara::Poltergeist
  Client.class_eval do
    def start
      @pid = Process.spawn(*command.map(&:to_s), pgroup: true)
      ObjectSpace.define_finalizer(self, self.class.process_killer(@pid))
    end

    def stop
      if pid
        kill_phantomjs
        ObjectSpace.undefine_finalizer(self)
      end
    end
  end
end

Next steps

Right now, this is more of a bag of code than a bona fide user-friendly gem. One next step would be to add some configuration options for PhantomJS that get passed via render_page to poltergeist and then on to the command line. Another would be to use net-http-persistent, which is actually included here as a dependency but isn't yet used.

Contributing

  1. Fork it ( http://github.com//sunbro/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request