Scraypa

A Ruby gem to scrape web content with configuration options including:

  1. Javscript support
  2. The Onion Router (Tor)
  3. Disguise (TODO)

Scraypa is essentially a wrapper for the light-weight Rest Client (if you dont require javascript support) or Capybara (for Javascript support).

Why?

A web scraper that can be configured to support javascript and/or Tor. If javascript is not required, it will use the lighter Rest Client. Scraypa is an attempt to remove the complexities associated to web agent setup.

Installation

Install Tor (optional)

If you want to use Tor, install tor:

sudo apt-get install tor

Install Headless Chrome (optional)

If you want to use :headless_chrome with capybara, install headless chrome by following instructions here:

http://blog.faraday.io/headless-chromium/

For ubuntu I did this:

  1. Install chromium:

    git clone https://github.com/scheib/chromium-latest-linux.git
    cd chromium-latest-linux
    ./update-and-run.sh
    
  2. Install chromedriver by following the build instructions.

Install Scraypa

Add this line to your application's Gemfile:

gem 'scraypa'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install scraypa

Usage

response = Scraypa.visit(method: :get,
                         url: "http://example.com")

#the response contains the RestClient response object
response.code
#-> 200
response.to_str
#-> http://example.com content

By default Scraypa uses the rest-client gem which does not support Javascript. The #visit method wraps the
RestClient#execute method so you can pass in whatever RestClient#execute will accept, for example:

Scraypa.visit(method: :get, 
              url: 'http://example.com/resource',
              timeout: 10, 
              headers: {params: {foo: 'bar'}})

➔ GET http://example.com/resource?foo=bar

Javascript Support

Capybara is used for Javascript support:

#configure Scraypa to #use_capybara
#and choose your capybara driver, here is poltergeist:
Scraypa.configure do |config|
  config.use_capybara = true
  config.driver = :poltergeist
  config.driver_options = {
      :phantomjs => Phantomjs.path,
      :js_errors => false,
      :phantomjs_options => ["--web-security=true"]
  }

  #or you could instead use headless_chrome:
  #config.driver = :headless_chromium
  #config.driver_options = {
  #  browser: :chrome,
  #  desired_capabilities: Selenium::WebDriver::Remote::Capabilities.chrome(
  #    "chromeOptions" => {
  #      "binary" => "/home/resrev/chromium/src/out/Default/chrome",
  #      "args" => %w{headless no-sandbox disable-gpu}
  #    }
  #  )
  #}
end

#when using capybara, just the url parameter is required:
response = Scraypa.visit(url: "http://example.com")

#the response contains the capybara page object
response.status_code
#-> 200
response.text
#-> http://example.com content 

#execute some javascript:
response.execute_script(
  "document.getElementsByTagName('body')[0].innerHTML = 'changed content';")
response.text
#-> "changed content"

Tor

TODO

Disguise

TODO

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/joshweir/scraypa.

License

The gem is available as open source under the terms of the MIT License.