Class: NewsScraper::Scraper
- Inherits:
-
Object
- Object
- NewsScraper::Scraper
- Defined in:
- lib/news_scraper/scraper.rb
Instance Method Summary collapse
-
#initialize(query:) ⇒ Scraper
constructor
Initialize a Scraper object.
-
#scrape ⇒ Object
Fetches articles from Extraction sources and scrapes the results.
Constructor Details
#initialize(query:) ⇒ Scraper
Initialize a Scraper object
Params
-
query
: a keyword arugment specifying the query to scrape
8 9 10 |
# File 'lib/news_scraper/scraper.rb', line 8 def initialize(query:) @query = query end |
Instance Method Details
#scrape ⇒ Object
Fetches articles from Extraction sources and scrapes the results
Yields
-
Will yield individually extracted articles
Raises
-
Will raise a
Transformers::ScrapePatternNotDefined
if an article is not in the root domains-
Will
yield
the error if a block is given -
Root domains are specified by the
article_scrape_patterns.yml
file -
This root domain will need to be trained, it would be helpful to have a PR created to train the domain
-
You can train the domain by running
NewsScraper::Trainer::UrlTrainer.new(URL_TO_TRAIN).train
-
Returns
-
transformed_articles
: The transformed articles fetched from the extracted sources
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/news_scraper/scraper.rb', line 27 def scrape article_urls = Extractors::GoogleNewsRss.new(query: @query).extract transformed_articles = [] article_urls.each do |article_url| payload = Extractors::Article.new(url: article_url).extract article_transformer = Transformers::Article.new(url: article_url, payload: payload) begin transformed_article = article_transformer.transform transformed_articles << transformed_article yield transformed_article if block_given? rescue Transformers::ScrapePatternNotDefined => e raise e unless block_given? yield e end end transformed_articles end |