Class: RubyCrawl::SiteCrawler

Inherits:

Object

Object
RubyCrawl::SiteCrawler

show all

Defined in:: lib/rubycrawl/site_crawler.rb

Overview

BFS crawler that follows links with deduplication.

Defined Under Namespace

Classes: PageResult

Instance Method Summary collapse

#crawl(start_url, &block) ⇒ Object
#initialize(client, options = {}) ⇒ SiteCrawler constructor

A new instance of SiteCrawler.

Constructor Details

#initialize(client, options = {}) ⇒ `SiteCrawler`

Returns a new instance of SiteCrawler.

# File 'lib/rubycrawl/site_crawler.rb', line 43

def initialize(client, options = {})
  @client = client
  @max_pages = options.fetch(:max_pages, 50)
  @max_depth = options.fetch(:max_depth, 3)
  @same_host_only = options.fetch(:same_host_only, true)
  @wait_until = options.fetch(:wait_until, nil)
  @block_resources = options.fetch(:block_resources, nil)
  @max_attempts        = options.fetch(:max_attempts, nil)
  @respect_robots_txt  = options.fetch(:respect_robots_txt, false)
  @visited = Set.new
  @queue = []
end

Instance Method Details

#crawl(start_url, &block) ⇒ `Object`

Raises:

(ArgumentError)

# File 'lib/rubycrawl/site_crawler.rb', line 56

def crawl(start_url, &block)
  raise ArgumentError, 'Block required for site crawl' unless block_given?

  normalized = UrlNormalizer.normalize(start_url)
  raise ConfigurationError, "Invalid start URL: #{start_url}" unless normalized

  @base_url = normalized
  @robots   = @respect_robots_txt ? RobotsParser.fetch(@base_url) : nil
  enqueue(normalized, 0)
  process_queue(&block)
end

Class: RubyCrawl::SiteCrawler

Overview

Defined Under Namespace

Instance Method Summary collapse

Constructor Details

#initialize(client, options = {}) ⇒ SiteCrawler

Instance Method Details

#crawl(start_url, &block) ⇒ Object

#initialize(client, options = {}) ⇒ `SiteCrawler`

#crawl(start_url, &block) ⇒ `Object`