Class: SiteMapper::Crawler

Inherits:

Object

Object
SiteMapper::Crawler

show all

Defined in:: lib/site_mapper/crawler.rb

Overview

Crawls a given site.

Defined Under Namespace

Classes: CrawlQueue

Constant Summary collapse

OPTIONS = Default options

{
  sleep_length: 0.5,
  max_requests: Float::INFINITY
}

Class Method Summary collapse

.collect_urls(*args) ⇒ Array

See documentation for the instance variant of this method.

Instance Method Summary collapse

#collect_urls ⇒ Array

Collects all links on domain for domain.
#initialize(url, options = {}) ⇒ Crawler constructor

A new instance of Crawler.

Constructor Details

#initialize(url, options = {}) ⇒ `Crawler`

Returns a new instance of Crawler.

Examples:

Create crawler with custom User-Agent

Crawler.new('example.com', user_agent: 'MyUserAgent')

Create crawler and sleep 1 second between each request

Crawler.new('example.com', sleep_length: 1)

Create crawler and perform max 3 requests

Crawler.new('example.com', max_requests: 3)

Parameters:

url (String) —

base url for crawler
options (Hash) (defaults to: {}) —

hash

# File 'lib/site_mapper/crawler.rb', line 21

def initialize(url, options = {})
  @base_url    = Request.resolve_url(url)
  @options     = OPTIONS.dup.merge(options)
  @user_agent  = @options.fetch(:user_agent)
  @crawl_url   = CrawlUrl.new(@base_url)
  @fetch_queue = CrawlQueue.new
  @processed   = Set.new
  @robots      = nil
end

Class Method Details

.collect_urls(*args) ⇒ `Array`

See documentation for the instance variant of this method.

Returns:

(Array) —

with links.

Instance Method Details

#collect_urls ⇒ `Array`

Collects all links on domain for domain.