HtmlScraper

HtmlScraper is a ruby gem that transforms the content from a html web page to a json document following a defined html template

Installation

Add this line to your application's Gemfile:

gem 'html_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install html_scraper

Usage

Simple html parsing

Expressions sourrounded by {{ }} will be parsed as simple json attributes:

template = '
     <div class="person">
        <h5>{{ surname }}</h5>
        <p>{{ name }}</p>
    </div>
'
html = '
    <html>
      <body>
          <div class="person">
            <h5>Eastwood</h5>
            <p>Clint</p>
          </div>
      </body>
    </html>
 '
 json = HtmlScraper::Scraper.new(template: template).parse(html)

The json result:

{:surname=>"Eastwood", :name=>"Clint"}

Iterative data

To parse iterative structures define the attribute hs-repeat to the html node containing the iteration:

template = '
  <div id="people-list">
    <div class="person" hs-repeat="people">
      <h5>{{ surname }}</h5>
      <p>{{ name }}</p>
    </div>
  </div>
'

html = '
  <html>
  <body>
    <div id="people-list">
      <div class="person">
        <h5>Eastwood</h5>
        <p>Clint</p>
      </div>
      <div class="person">
        <h5>Woods</h5>
        <p>James</p>
      </div>
      <div class="person">
        <h5>Kinski</h5>
        <p>Klaus</p>
      </div>
    </div>
  </body>
  </html>
'
json = HtmlScraper::Scraper.new(template: template).parse(html)

The json result:

{:people=>
  [{:surname=>"Eastwood", :name=>"Clint"},
   {:surname=>"Woods", :name=>"James"},
   {:surname=>"Kinski", :name=>"Klaus"}]}
```

## Development

After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).

## Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/html_scraper.


## License

The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).