Scraped
Write declarative scrapers in Ruby
Installation
Add this line to your application’s Gemfile:
ruby
gem 'scraped'
And then execute:
$ bundle
Or install it yourself as:
$ gem install scraped
Usage
To write a standard HTML scraper, start by creating a subclass of
Scraped::HTML
for each type of page you wish to scrape.
For example if you were scraping a list of people you might have a
PeopleListPage
class for the list page and a PersonPage
class for an
individual person’s page.
```ruby require ‘scraped’
class ExamplePage < Scraped::HTML field :title do noko.at_css(‘h1’).text end
field :more_information do noko.at_css(‘a’)[:href] end end ```
Then you can create a new instance and pass in a Scraped::Response
instance.
```ruby page = ExamplePage.new(response: Scraped::Request.new(url: ‘http://example.com’).response)
page.title # => “Example Domain”
page.more_information # => “http://www.iana.org/domains/reserved”
page.to_h # => { :title => “Example Domain”, :more_information => “http://www.iana.org/domains/reserved” } ```
Dealing with sections of a page
When writing an HTML scraper you’ll often need to deal with just a part of the page. For example you might want to scrape a table containing a list of people and some associated data.
To do this you can use the fragment
method, passing it a hash with one entry
where the key is the noko
fragment you want to use and the value is the class
that should handle that fragment.
```ruby class MemberRow < Scraped::HTML field :name do noko.css(‘td’)[2].text end
field :party do noko.css(‘td’)[3].text end end
class AllMembersPage < Scraped::HTML field :members do noko.css(‘table.members-list tr’).map do |row| fragment row => MemberRow end end end ```
Passing request headers
To set request headers you can pass a headers:
argument to Scraped::Request.new
:
ruby
response = Scraped::Request.new(url: 'http://example.com', headers: { 'Cookie' => 'user_id' => '42' }).response
page = ExamplePage.new(response: response)
Extending
There are two main ways to extend scraped
with your own custom logic - custom requests and decorated responses. Custom requests allow you to change where the scraper is getting its responses from, e.g. you might want to make requests to archive.org if the site you’re scraping has disappeared. Decorated responses allow you to manipulate the response before it’s passed to the scraper. Scraped comes with some built in decorators for common tasks such as making all the link urls on the page absolute rather than relative.
Custom request strategies
To make a custom request you’ll need to create a class that subclasses Scraped::Request::Strategy
and defines a response
method.
```ruby class FileOnDiskRequest < Scraped::Request::Strategy def response { body: open(filename).read } end
private
def filename @filename ||= File.join(URI.parse(url).host, Digest::SHA1.hexdigest(url)) end end ```
The response
method should return a Hash
which has at least a body
key. You can also include status
and headers
parameters in the hash to fill out those fields in the response. If not given, status will default to 200
(OK) and headers will default to {}
.
To use a custom request strategy pass it to Scraped::Request
:
ruby
request = Scraped::Request.new(url: 'http://example.com', strategies: [FileOnDiskRequest, Scraped::Request::Strategy::LiveRequest])
page = MyPersonPage.new(response: request.response)
Decorated responses
To manipulate the response before it is processed by the scraper create a class that subclasses Scraped::Response::Decorator
and defines any of the following methods: body
, url
, status
, headers
.
ruby
class AbsoluteLinks < Scraped::Response::Decorator
def body
doc = Nokogiri::HTML(super)
doc.css('a').each do |link|
link[:href] = URI.join(url, link[:href]).to_s
end
doc.to_s
end
end
As well as the body
method you can also supply your own url
, status
and headers
methods. You can access the current request body by calling super
from your method. You can also call url
, headers
or status
to access those properties of the current response.
To use a response decorator you need to use the decorator
class method in a Scraped::HTML
subclass:
```ruby class PageWithRelativeLinks < Scraped::HTML decorator AbsoluteLinks
# Other fields… end ```
Configuring requests and responses
When passing an array of request strategies or response decorators you should always pass the class, rather than the instance. If you want to configure an instance you can pass in a two element array where the first element is the class and the second element is the config:
```ruby class CustomHeader < Scraped::Response::Decorator def headers response.headers.merge(‘X-Greeting’ => config[:greeting]) end end
class ExamplePage < Scraped::HTML decorator CustomHeader, greeting: ‘Hello, world’ end ```
With the above code a custom header would be added to the response: X-Greeting: Hello, world
.
Inheritance with decorators
When you inherit from a class that already has decorators the child class will also inherit the parent’s decorators. There’s currently no way to re-order or remove decorators in child classes, though that may be added in the future.
Built in decorators
Clean link and image URLs
You will likely want to normalize link and image urls on the page you are scraping. Scraped::Response::Decorator::CleanUrls
ensures that each link is absolute and does not contain any encoded characters.
```ruby require ‘scraped’
class MemberPage < Scraped::HTML decorator Scraped::Response::Decorator::AbsoluteUrls
field :image do # Image url will be absolute and encoded correctly thanks to the decorator. noko.at_css(‘.profile-picture/@src’).text end end ```
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/everypolitician/scraped.
License
The gem is available as open source under the terms of the MIT License.