Class: SiteMapper::Crawler

Inherits:
Object
  • Object
show all
Defined in:
lib/site_mapper/crawler.rb

Overview

Crawls a given site.

Defined Under Namespace

Classes: CrawlQueue

Constant Summary collapse

OPTIONS =

Default options

{
  sleep_length: 0.5,
  max_requests: Float::INFINITY
}

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, options = {}) ⇒ Crawler

Returns a new instance of Crawler.

Examples:

Create crawler with custom User-Agent

Crawler.new('example.com', user_agent: 'MyUserAgent')

Create crawler and sleep 1 second between each request

Crawler.new('example.com', sleep_length: 1)

Create crawler and perform max 3 requests

Crawler.new('example.com', max_requests: 3)

Parameters:

  • url (String)

    base url for crawler

  • options (Hash) (defaults to: {})

    hash



21
22
23
24
25
26
27
28
29
# File 'lib/site_mapper/crawler.rb', line 21

def initialize(url, options = {})
  @base_url    = Request.resolve_url(url)
  @options     = OPTIONS.dup.merge(options)
  @user_agent  = @options.fetch(:user_agent)
  @crawl_url   = CrawlUrl.new(@base_url)
  @fetch_queue = CrawlQueue.new
  @processed   = Set.new
  @robots      = nil
end

Class Method Details

.collect_urls(*args) ⇒ Array

See documentation for the instance variant of this method.

Returns:

  • (Array)

    with links.

See Also:



34
35
36
# File 'lib/site_mapper/crawler.rb', line 34

def self.collect_urls(*args)
  new(*args).collect_urls { |url| yield(url) }
end

Instance Method Details

#collect_urlsArray

Collects all links on domain for domain.

Examples:

URLs for example.com

crawler = Crawler.new('example.com')
crawler.collect_urls

URLs for example.com with block (executes in its own thread)

crawler = Crawler.new('example.com')
crawler.collect_urls do |new_url|
  puts "New URL found: #{new_url}"
end

Returns:

  • (Array)

    with links.



48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/site_mapper/crawler.rb', line 48

def collect_urls
  @fetch_queue << @crawl_url.resolved_base_url
  until @fetch_queue.empty? || @processed.length >= @options[:max_requests]
    url = @fetch_queue.pop
    yield(url)
    page_urls_for(url)
  end
  result = @processed + @fetch_queue
  Logger.log "Crawling finished:"
  Logger.log "Processed links: #{@processed.length}"
  Logger.log "Found links:     #{result.length}"
  result.to_a
rescue Interrupt, IRB::Abort
  Logger.err_log 'Crawl interrupted.'
  @fetch_queue.to_a
end