Class: RubyCrawl::SiteCrawler

Inherits:
Object
  • Object
show all
Defined in:
lib/rubycrawl/site_crawler.rb

Overview

BFS crawler that follows links with deduplication.

Defined Under Namespace

Classes: PageResult

Instance Method Summary collapse

Constructor Details

#initialize(client, options = {}) ⇒ SiteCrawler

Returns a new instance of SiteCrawler.



31
32
33
34
35
36
37
38
39
40
# File 'lib/rubycrawl/site_crawler.rb', line 31

def initialize(client, options = {})
  @client = client
  @max_pages = options.fetch(:max_pages, 50)
  @max_depth = options.fetch(:max_depth, 3)
  @same_host_only = options.fetch(:same_host_only, true)
  @wait_until = options.fetch(:wait_until, nil)
  @block_resources = options.fetch(:block_resources, nil)
  @visited = Set.new
  @queue = []
end

Instance Method Details

#crawl(start_url, &block) ⇒ Object

Raises:

  • (ArgumentError)


42
43
44
45
46
47
48
49
50
51
# File 'lib/rubycrawl/site_crawler.rb', line 42

def crawl(start_url, &block)
  raise ArgumentError, 'Block required for site crawl' unless block_given?

  normalized = UrlNormalizer.normalize(start_url)
  raise ConfigurationError, "Invalid start URL: #{start_url}" unless normalized

  @base_url = normalized
  enqueue(normalized, 0)
  process_queue(&block)
end