Sunbro
Some code that I use to crawl the web at scale with Poltergeist and PhantomJS (cf. stretched.io). Uses a bunch of code from the venerable anemone gem. Released in the spirit of jolly cooperation.
Installation
Add this line to your application's Gemfile:
gem 'sunbro'
And then execute:
$ bundle
Or install it yourself as:
$ gem install sunbro
Usage
I use sunbro to crawl the web at scale via Sidekiq on EC2. I've found that web scraping with capybara/poltergeist + phantomjs is a giant pain on JRuby (for various reasons that you'll encounter once you try it), and this gem is basically my collection of fixes that makes it actually work. And it works pretty well; I use in production to crawl 230 sites and counting.
Here's an example of a worker that looks something like what you might find in my code:
class CrawlerWorker
def perform(opts)
@connection = Sunbro::Connection.new
return unless @links = opts[:links]
links.each do |link|
next unless page = @connection.get_page(link)
puts "Page #{page.url} returned code #{page.code} with body size #{page.body.size}"
end
ensure
@connection.close
end
end
The above uses net-http to fetch connections, and it pools
them. This is all you need most of the time. However, if you're scraping
a page that is AJAX-heavy, that's where you'll get the most out of sunbro.
To use phantomjs to scrape a page, you'll want to call connection.render_page(link).
This renders the JS on the page, but doesn't download any images.
The one option to either get_page or render_page is
:force_format, can be one of :html, :xml, or :auto. If the
option is set to :html, then Nokogiri::HTML will be used to parse
page.body; if it's set to :xml, then Nokogiri::XML is used. If
it's set to :auto or nil, Nokogiri.parse is called.
Configuration
You can configure a few options in a config/initializers/sunbro.rb
file, as follows:
Sunbro::Settings.configure do |config|
config.user_agent = ENV['USER_AGENT_STRING1']
config.phantomjs_user_agent = ENV['USER_AGENT_STRING2']
config.page_format = :auto
end
PhantomJS zombie process monkey patch
I use the following monkey patch for PhantomJS, because it has zombie process issues when it comes to JRuby. This monkey patch kills some minor PhantomJS functionality that I don't use, and you can read more about what it does and why, in this blog post.
I put this in config/initializers/phantomjs.rb
require "capybara"
require "capybara/poltergeist"
require "capybara/poltergeist/utility"
module Capybara::Poltergeist
Client.class_eval do
def start
@pid = Process.spawn(*command.map(&:to_s), pgroup: true)
ObjectSpace.define_finalizer(self, self.class.process_killer(@pid))
end
def stop
if pid
kill_phantomjs
ObjectSpace.undefine_finalizer(self)
end
end
end
end
Next steps
Right now, this is more of a bag of code than a bona fide user-friendly
gem. One next step would be to add some configuration options for PhantomJS
that get passed via render_page to poltergeist and then on to the
command line. Another would be to use net-http-persistent, which is
actually included here as a dependency but isn't yet used.
Contributing
- Fork it ( http://github.com/
/sunbro/fork ) - Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create new Pull Request