
Simple ETL news scraper in Ruby


A collection of extractors, transformers and loaders for a variety of news feeds and outlets.


Add this line to your application's Gemfile:

gem 'news_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install news_scraper



NewsScraper::Scraper#scrape will return an array of the transformed data for all Google News RSS articles for the given query.

Optionally, you can pass in a block and it will yield the transformed data on a per-article basis.

It takes in 1 parameter query:.

Array notation

article_hashes = 'Shopify').scrape # [ { author: ... }, { author: ... } ... ]

Block notation 'Shopify').scrape do |article_hash|
  # { author: ... }

How the Scraper extracts and parses for the information is determined by scrape patterns (see Scrape Patterns).

Transformed Data

Calling NewsScraper::Scraper#scrape with either the array or block notation will yield transformed_data hashes. article_scrape_patterns.yml defines the data types that will be scraped for.

In addition, the uri and root_domain(hostname) of the article will be returned in the hash too.


  author: 'Linus Torvald',
  body: 'The Linux kernel developed by Linus Torvald has become the backbone of most electronic devices we use to-date. It powers mobile phones, laptops, embedded devices, and even rockets...',
  description: 'The Linux kernel is one of the most important contributions to the world of technology.',
  keywords: 'linux,kernel,linus,torvald',
  section: 'technology',
  datetime: '1991-10-05T12:00:00+00:00',
  title: 'Linus Linux',
  uri: '',
  root_domain: ''

Scrape Patterns

Scrape patterns are xpath or CSS patterns used by Nokogiri to extract relevant HTML elements.

Extracting each :data_type (see Example under Transformed Data) requires a scrape pattern. A few :presets are specified in article_scrape_patterns.yml.

Since each news site (identified with :root_domain) uses a different markup, scrape patterns are defined on a per-:root_domain basis.

Specifying scrape patterns for new, undefined :root_domains is called training (see Training).


For each :root_domain, it is neccesary to specify a scrape pattern for each of the :data_types. A rake task was written to provide a CLI for appending new :root_domains using :preset scrape patterns.

Simply run

bundle exec rake scraper:train QUERY=<query>

where the CLI will step through the articles and :root_domains of the articles relevant to <query>.


After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to


