Class: NewsScraper::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/news_scraper/scraper.rb

Instance Method Summary collapse

Constructor Details

#initialize(query:) ⇒ Scraper

Initialize a Scraper object

Params

  • query: a keyword arugment specifying the query to scrape



8
9
10
# File 'lib/news_scraper/scraper.rb', line 8

def initialize(query:)
  @query = query
end

Instance Method Details

#scrapeObject

Fetches articles from Extraction sources and scrapes the results

Yields

  • Will yield individually extracted articles

Raises

  • Will raise a Transformers::ScrapePatternNotDefined if an article is not in the root domains

    • Will yield the error if a block is given

    • Root domains are specified by the article_scrape_patterns.yml file

    • This root domain will need to be trained, it would be helpful to have a PR created to train the domain

    • You can train the domain by running NewsScraper::Trainer::UrlTrainer.new(URL_TO_TRAIN).train

Returns

  • transformed_articles: The transformed articles fetched from the extracted sources



27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/news_scraper/scraper.rb', line 27

def scrape
  article_urls = Extractors::GoogleNewsRss.new(query: @query).extract

  transformed_articles = []

  article_urls.each do |article_url|
    payload = Extractors::Article.new(url: article_url).extract
    article_transformer = Transformers::Article.new(url: article_url, payload: payload)

    begin
      transformed_article = article_transformer.transform
      transformed_articles << transformed_article
      yield transformed_article if block_given?
    rescue Transformers::ScrapePatternNotDefined => e
      raise e unless block_given?
      yield e
    end
  end

  transformed_articles
end