Scraping
A really simple HTML scraping DSL.
Installation
Add this line to your application's Gemfile:
gem 'scraping'
And then execute:
$ bundle
Usage
A simple example
class Person
include Scraping
element :name, 'h1'
end
person = Person.scrape('<h1>Millard Fillmore</h1>')
person.name #=> 'Millard Fillmore'
More complex data structures
You can also scrape arrays, objects, and arrays of objects. elements and elements_of can be deeply nested.
class YouCan
include Scraping
elements :scrape, '.scrape'
sections :also_scrape, '.also-scrape li' do
element :name, 'a'
element :link, 'a/@href'
elements :numbers, 'span'
end
section :nested_scrape do
element :data, '.data'
end
end
you_can = YouCan.scrape(" <p class=\"scrape\">\n <span>Arrays</span>\n <span>Too</span>\n </p>\n\n <ul class=\"also-scrape\">\n <li>\n <a href=\"example.com\">Meek Mill</a>\n <span>1</span>\n <span>2</span>\n </li>\n <li><a href=\"test.com\">Drake</a></li>\n <ul>\n\n <p class=\"data\">Beef</p>\n")
you_can.scrape #=> ['Arrays', 'Too']
you_can.also_scrape.first.name #=> 'Meek Mill'
you_can.also_scrape.first.link #=> 'example.com'
you_can.also_scrape.first.numbers #=> ['1', '2']
you_can.nested_scrape.data #=> 'Beef'
Customizing extraction
Any block given to #element will allow you to customize the value extracted from the found node.
Using as: :something would call a method named #extract_something.
class Advanced
element :first_name, '.name' do |node|
node.text.split(', ').first
end
element :birthday, '.birthday', as: :date
elements :numbers, 'span' do |node|
node.text.to_i * 10
end
private
def extract_date(node)
Date.parse(node.text)
end
end
advanced = Advanced.new(" <h1 class=\"name\">Millard Fillmore</h1>\n <h2 class=\"birthday\">7-1-1800</h2>\n <span>1</span>\n <span>2</span>\n")
advanced.first_name #=> 'Millard'
advanced.birthday #=> #<Date: 1800-01-07>
advanced.numbers #=> [10, 20]
HTTP
Scraping is totally agnostic of HTTP, but if you need a suggestion, check out HTTParty.
class HackerNews
include HTTParty
include Scraping
base_uri 'https://news.ycombinator.com'
elements :stories, '.athing .title > a'
def self.scrape
super get('/').body
end
end
news = HackerNews.scrape
puts news.stories.inspect
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/promptworks/scraping.
License
The gem is available as open source under the terms of the MIT License.