Scraped
Write declarative scrapers in Ruby
Installation
Add this line to your application's Gemfile:
gem 'scraped'
And then execute:
$ bundle
Or install it yourself as:
$ gem install scraped
Usage
To write a standard HTML scraper, start by creating a subclass of
Scraped::HTML
for each type of page you wish to scrape.
For example if you were scraping a list of people you might have a
PeopleListPage
class for the list page and a PersonPage
class for an
individual person's page.
require 'scraped'
class ExamplePage < Scraped::HTML
field :title do
noko.at_css('h1').text
end
field :more_information do
noko.at_css('a')[:href]
end
end
Then you can create a new instance and pass in a Scraped::Response
instance.
page = ExamplePage.new(response: Scraped::Request.new(url: 'http://example.com').response)
page.title
# => "Example Domain"
page.more_information
# => "http://www.iana.org/domains/reserved"
page.to_h
# => { :title => "Example Domain", :more_information => "http://www.iana.org/domains/reserved" }
Dealing with sections of a page
When writing an HTML scraper you'll often need to deal with just a part of the page. For example you might want to scrape a table containing a list of people and some associated data.
To do this you can use the fragment
method, passing it a hash with one entry
where the key is the noko
fragment you want to use and the value is the class
that should handle that fragment.
class MemberRow < Scraped::HTML
field :name do
noko.css('td')[2].text
end
field :party do
noko.css('td')[3].text
end
end
class AllMembersPage < Scraped::HTML
field :members do
noko.css('table.members-list tr').map do |row|
fragment row => MemberRow
end
end
end
Extending
There are two main ways to extend scraped
with your own custom logic - custom requests and decorated responses. Custom requests allow you to change where the scraper is getting its responses from, e.g. you might want to make requests to archive.org if the site you're scraping has disappeared. Decorated responses allow you to manipulate the response before it's passed to the scraper. Scraped comes with some built in decorators for common tasks such as making all the link urls on the page absolute rather than relative.
Custom request strategies
To make a custom request you'll need to create a class that subclasses Scraped::Request::Strategy
and defines a response
method.
class FileOnDiskRequest < Scraped::Request::Strategy
def response
{ body: open(filename).read }
end
private
def filename
@filename ||= File.join(URI.parse(url).host, Digest::SHA1.hexdigest(url))
end
end
The response
method should return a Hash
which has at least a body
key. You can also include status
and headers
parameters in the hash to fill out those fields in the response. If not given, status will default to 200
(OK) and headers will default to {}
.
To use a custom request strategy pass it to Scraped::Request
:
request = Scraped::Request.new(url: 'http://example.com', strategies: [FileOnDiskRequest, Scraped::Request::Strategy::LiveRequest])
page = MyPersonPage.new(response: request.response)
Decorated responses
To manipulate the response before it is processed by the scraper create a class that subclasses Scraped::Response::Decorator
and defines any of the following methods: body
, url
, status
, headers
.
class AbsoluteLinks < Scraped::Response::Decorator
def body
doc = Nokogiri::HTML(super)
doc.css('a').each do |link|
link[:href] = URI.join(url, link[:href]).to_s
end
doc.to_s
end
end
As well as the body
method you can also supply your own url
, status
and headers
methods. You can access the current request body by calling super
from your method. You can also call url
, headers
or status
to access those properties of the current response.
To use a response decorator you need to use the decorator
class method in a Scraped::HTML
subclass:
class PageWithRelativeLinks < Scraped::HTML
decorator AbsoluteLinks
# Other fields...
end
Configuring requests and responses
When passing an array of request strategies or response decorators you should always pass the class, rather than the instance. If you want to configure an instance you can pass in a two element array where the first element is the class and the second element is the config:
class CustomHeader < Scraped::Response::Decorator
def headers
response.headers.merge('X-Greeting' => config[:greeting])
end
end
class ExamplePage < Scraped::HTML
decorator CustomHeader, greeting: 'Hello, world'
end
With the above code a custom header would be added to the response: X-Greeting: Hello, world
.
Inheritance with decorators
When you inherit from a class that already has decorators the child class will also inherit the parent's decorators. There's currently no way to re-order or remove decorators in child classes, though that may be added in the future.
Built in decorators
Absolute link and image urls
Very frequently you will find that you need to make links and images on the page
you are scraping absolute rather than relative. Scraped comes with support for
this out of the box via the Scraped::Response::Decorator::AbsoluteUrls
decorator.
require 'scraped'
class MemberPage < Scraped::HTML
decorator Scraped::Response::Decorator::AbsoluteUrls
field :image do
# Image url will be absolute thanks to the decorator.
noko.at_css('.profile-picture/@src').text
end
end
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/everypolitician/scraped.
License
The gem is available as open source under the terms of the MIT License.