Grell

Grell is a generic crawler for the web written in Ruby. It can be used to gather data, test pages in a given domain, etc.

Installation

Add this line to your application's Gemfile:

gem 'grell'

And then execute:

$ bundle

Or install it yourself as:

$ gem install grell

Grell uses PhantomJS, you will need to download and install it in your system. Check for instructions in http://phantomjs.org/ Grell has been tested with PhantomJS v1.9.x

Usage

Crawling an entire site

The main entry point of the library is Grell#start_crawling. Grell will yield to your code with each page it finds:

require 'grell'

crawler = Grell::Crawler.new
crawler.start_crawling('http://www.google.com') do |page|
  #Grell will keep iterating this block which each unique page it finds
  puts "yes we crawled #{page.url}"
  puts "status: #{page.status}"
  puts "headers: #{page.headers}"
  puts "body: #{page.body}"
  puts "We crawled it at #{page.timestamp}"
  puts "We found #{page.links.size} links"
  puts "page id and parent_id #{page.id}, #{page.parent_id}"
end

Grell keeps a list of pages previously crawled and do not visit the same page twice. This list is indexed by the complete url, including query parameters.

Re-retrieving a page

If you want Grell to revisit a page and return the data to you again, return the symbol :retry in your block in the start_crawling method. For instance

require 'grell'
crawler = Grell::Crawler.new
crawler.start_crawling('http://www.google.com') do |current_page|
  if current_page.status == 500 && current_page.retries == 0
    crawler.restart
    :retry
  end
end

Restarting PhantomJS

If you are doing a long crawling it is possible that phantomJS starts failing. To avoid that, you can restart it by calling "restart" on crawler. That will kill phantom and will restart it. Grell will keep the status of pages already visited and pages discovered and to be visited. And will keep crawling with the new phantomJS process instead of the old one.

Selecting links to follow

Grell by default will follow all the links it finds going to the site your are crawling. It will never follow links linking outside your site. If you want to further limit the amount of links crawled, you can use whitelisting, blacklisting or manual filtering.

Custom URL Comparison

By default, Grell will detect new URLs to visit by comparing the full URL with the URLs of the discovered and visited links. This functionality can be changed by passing a block of code to Grells start_crawling method. In the below example, the path of the URLs (instead of the full URL) will be compared.

require 'grell'

crawler = Grell::Crawler.new

add_match_block = Proc.new do |collection_page, page|
  collection_page.path == page.path
end

crawler.start_crawling('http://www.google.com', add_match_block: add_match_block) do |current_page|
...
end

Whitelisting

require 'grell'

crawler = Grell::Crawler.new
crawler.whitelist([/games\/.*/, '/fun'])
crawler.start_crawling('http://www.google.com')

Grell here will only follow links to games and '/fun' and ignore all other links. You can provide a regexp, strings (if any part of the string match is whitelisted) or an array with regexps and/or strings.

Blacklisting

require 'grell'

crawler = Grell::Crawler.new
crawler.blacklist(/games\/.*/)
crawler.start_crawling('http://www.google.com')

Similar to whitelisting. But now Grell will follow every other link in this site which does not go to /games/...

If you call both whitelist and blacklist then both will apply, a link has to fullfill both conditions to survive. If you do not call any, then all links on this site will be crawled. Think of these methods as filters.

Manual link filtering

If you have a more complex use-case, you can modify the list of links manually. Grell yields the page to you before it adds the links to the list of links to visit. So you can modify in your block of code "page.links" to add and delete links to instruct Grell to add them to the list of links to visit next.

Pages' id

Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'. The page object generated by accessing the first URL passed to the start_crawling(the root) has a 'parent_id' equal to 'nil' and an 'id' equal to 0. Using this information it is possible to construct a directed graph.

Errors

When there is an error in the page or an internal error in the crawler (Javascript crashed the browser, etc). Grell will return with status 404 and the headers will have the following keys:

grellStatus: 'Error'
errorClass: The class of the error which broke this page.
errorMessage: A descriptive message with the information Grell could gather about the error.

Logging

You can pass your logger to Grell. For example in a Rails app:

crawler = Grell::Crawler.new(logger: Rails.logger)

Tests

Run the tests with

bundle exec rake ci

Contributors

Grell is (c) Medidata Solutions Worldwide and owned by its major contributors: