Class: NewsScraper::Scraper

Inherits:

Object

Object
NewsScraper::Scraper

show all

Defined in:: lib/news_scraper/scraper.rb

Instance Method Summary collapse

#initialize(query:) ⇒ Scraper constructor

Initialize a Scraper object.
#scrape ⇒ Object

Fetches articles from Extraction sources and scrapes the results.

Constructor Details

#initialize(query:) ⇒ `Scraper`

Initialize a Scraper object

Params

query: a keyword arugment specifying the query to scrape



8
9
10

# File 'lib/news_scraper/scraper.rb', line 8

def initialize(query:)
  @query = query
end

Instance Method Details

#scrape ⇒ `Object`

Fetches articles from Extraction sources and scrapes the results

Yields

Will yield individually extracted articles

Raises

Will raise a Transformers::ScrapePatternNotDefined if an article is not in the root domains
- Will yield the error if a block is given
- Root domains are specified by the article_scrape_patterns.yml file
- This root domain will need to be trained, it would be helpful to have a PR created to train the domain
- You can train the domain by running NewsScraper::Trainer::UrlTrainer.new(URL_TO_TRAIN).train

Returns

transformed_articles: The transformed articles fetched from the extracted sources

# File 'lib/news_scraper/scraper.rb', line 27

def scrape
  article_urls = Extractors::GoogleNewsRss.new(query: @query).extract

  transformed_articles = []

  article_urls.each do |article_url|
    payload = Extractors::Article.new(url: article_url).extract
    article_transformer = Transformers::Article.new(url: article_url, payload: payload)

    begin
      transformed_article = article_transformer.transform
      transformed_articles << transformed_article
      yield transformed_article if block_given?
    rescue Transformers::ScrapePatternNotDefined => e
      raise e unless block_given?
      yield e
    end
  end

  transformed_articles
end

Class: NewsScraper::Scraper

Instance Method Summary collapse

Constructor Details

#initialize(query:) ⇒ Scraper

Instance Method Details

#scrape ⇒ Object

#initialize(query:) ⇒ `Scraper`

#scrape ⇒ `Object`