Scraping

A really simple HTML scraping DSL.

Installation

Add this line to your application's Gemfile:

gem 'scraping'

And then execute:

$ bundle

Usage

A simple example

class Person
  include Scraping
  element :name, 'h1'
end

person = Person.scrape('<h1>Millard Fillmore</h1>')
person.name #=> 'Millard Fillmore'

More complex data structures

You can also scrape arrays, objects, and arrays of objects. elements and elements_of can be deeply nested.

class YouCan
  include Scraping

  elements :scrape, '.scrape'

  sections :also_scrape, '.also-scrape li' do
    element :name, 'a'
    element :link, 'a/@href'
    elements :numbers, 'span'
  end

  section :nested_scrape do
    element :data, '.data'
  end
end

you_can = YouCan.scrape("  <p class=\"scrape\">\n    <span>Arrays</span>\n    <span>Too</span>\n  </p>\n\n  <ul class=\"also-scrape\">\n    <li>\n      <a href=\"example.com\">Meek Mill</a>\n      <span>1</span>\n      <span>2</span>\n    </li>\n    <li><a href=\"test.com\">Drake</a></li>\n  <ul>\n\n  <p class=\"data\">Beef</p>\n")

you_can.scrape #=> ['Arrays', 'Too']

you_can.also_scrape.first.name #=> 'Meek Mill'
you_can.also_scrape.first.link #=> 'example.com'
you_can.also_scrape.first.numbers #=> ['1', '2']

you_can.nested_scrape.data #=> 'Beef'

Customizing extraction

Any block given to #element will allow you to customize the value extracted from the found node.

Using as: :something would call a method named #extract_something.

class Advanced
  element :first_name, '.name' do |node|
    node.text.split(', ').first
  end

  element :birthday, '.birthday', as: :date

  elements :numbers, 'span' do |node|
    node.text.to_i * 10
  end

  private

  def extract_date(node)
    Date.parse(node.text)
  end
end

advanced = Advanced.new("  <h1 class=\"name\">Millard Fillmore</h1>\n  <h2 class=\"birthday\">7-1-1800</h2>\n  <span>1</span>\n  <span>2</span>\n")

advanced.first_name #=> 'Millard'
advanced.birthday #=> #<Date: 1800-01-07>
advanced.numbers #=> [10, 20]

HTTP

Scraping is totally agnostic of HTTP, but if you need a suggestion, check out HTTParty.

class HackerNews
  include HTTParty
  include Scraping

  base_uri 'https://news.ycombinator.com'
  elements :stories, '.athing .title > a'

  def self.scrape
    super get('/').body
  end
end

news = HackerNews.scrape
puts news.stories.inspect

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/promptworks/scraping.

License

The gem is available as open source under the terms of the MIT License.