Class: RubyCrawl::SiteCrawler
- Inherits:
-
Object
- Object
- RubyCrawl::SiteCrawler
- Defined in:
- lib/rubycrawl/site_crawler.rb
Overview
BFS crawler that follows links with deduplication.
Defined Under Namespace
Classes: PageResult
Instance Method Summary collapse
- #crawl(start_url, &block) ⇒ Object
-
#initialize(client, options = {}) ⇒ SiteCrawler
constructor
A new instance of SiteCrawler.
Constructor Details
#initialize(client, options = {}) ⇒ SiteCrawler
Returns a new instance of SiteCrawler.
31 32 33 34 35 36 37 38 39 40 |
# File 'lib/rubycrawl/site_crawler.rb', line 31 def initialize(client, = {}) @client = client @max_pages = .fetch(:max_pages, 50) @max_depth = .fetch(:max_depth, 3) @same_host_only = .fetch(:same_host_only, true) @wait_until = .fetch(:wait_until, nil) @block_resources = .fetch(:block_resources, nil) @visited = Set.new @queue = [] end |
Instance Method Details
#crawl(start_url, &block) ⇒ Object
42 43 44 45 46 47 48 49 50 51 |
# File 'lib/rubycrawl/site_crawler.rb', line 42 def crawl(start_url, &block) raise ArgumentError, 'Block required for site crawl' unless block_given? normalized = UrlNormalizer.normalize(start_url) raise ConfigurationError, "Invalid start URL: #{start_url}" unless normalized @base_url = normalized enqueue(normalized, 0) process_queue(&block) end |