Module: Html2rss::AutoSource::Scraper

Defined in:
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement a call method that returns an array of article hashes. Each scraper should also implement an articles? method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Defined Under Namespace

Classes: Html, NoScraperFound, Schema, SemanticHtml

Constant Summary collapse

SCRAPERS =
[
  Html,
  Schema,
  SemanticHtml
].freeze

Class Method Summary collapse

Class Method Details

.from(parsed_body) ⇒ Array<Class>

Returns an array of scrapers that claim to find articles in the parsed body.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

Returns:

  • (Array<Class>)

    An array of scraper classes that can handle the parsed body.

Raises:



26
27
28
29
30
31
# File 'lib/html2rss/auto_source/scraper.rb', line 26

def self.from(parsed_body)
  scrapers = SCRAPERS.select { |scraper| scraper.articles?(parsed_body) }
  raise NoScraperFound, 'No suitable scraper found for URL.' if scrapers.empty?

  scrapers
end