NewsScraper

Simple ETL news scraper in Ruby

A collection of extractors, transformers and loaders for a variety of news feeds and outlets.

Installation

Add this line to your application's Gemfile:

gem 'news_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install news_scraper

Usage

Scraping

NewsScraper::Scraper#scrape will return an array of the transformed data for all Google News RSS articles for the given query.

Optionally, you can pass in a block and it will yield the transformed data on a per-article basis.

It takes in 1 parameter query:.

Array notation

article_hashes = NewsScraper::Scraper.new(query: 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]

Block notation

NewsScraper::Scraper.new(query: 'Shopify').scrape do |article_hash|
  # { author: ... }
end

How the Scraper extracts and parses for the information is determined by scrape patterns (see Scrape Patterns).

Transformed Data

Calling NewsScraper::Scraper#scrape with either the array or block notation will yield transformed_data hashes. article_scrape_patterns.yml defines the data types that will be scraped for.

In addition, the uri and root_domain(hostname) of the article will be returned in the hash too.

Example

{
  author: 'Linus Torvald',
  body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
  description: 'The Linux kernel is one of the most important contributions to the world of technology.',
  keywords: 'linux,kernel,linus,torvald',
  section: 'technology',
  datetime: '1991-10-05T12:00:00+00:00',
  title: 'Linus Linux',
  uri: 'linusworld.com/the-linux-kernel',
  root_domain: 'linusworld.com'
}

Scrape Patterns

Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.

Extracting each :data_type (see Example under Transformed Data) requires a scrape pattern. A few :presets are specified in article_scrape_patterns.yml.

Since each news site (identified with :root_domain) uses a different markup, scrape patterns are defined on a per-:root_domain basis.

Specifying scrape patterns for new, undefined :root_domains is called training (see Training).

Training

For each :root_domain, it is neccesary to specify a scrape pattern for each of the :data_types. A rake task was written to provide a CLI for appending new :root_domains using :preset scrape patterns.

Simply run

bundle exec rake scraper:train QUERY=<query>

where the CLI will step through the articles and :root_domains of the articles relevant to <query>.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/richardwu/news_scraper. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.