Class: SiteMapper::Crawler
- Inherits:
-
Object
- Object
- SiteMapper::Crawler
- Defined in:
- lib/site_mapper/crawler.rb
Overview
Crawls a given site.
Defined Under Namespace
Classes: CrawlQueue
Constant Summary collapse
- OPTIONS =
Default options
{ sleep_length: 0.5, max_requests: Float::INFINITY }
Class Method Summary collapse
-
.collect_urls(*args) ⇒ Array
See documentation for the instance variant of this method.
Instance Method Summary collapse
-
#collect_urls ⇒ Array
Collects all links on domain for domain.
-
#initialize(url, options = {}) ⇒ Crawler
constructor
A new instance of Crawler.
Constructor Details
#initialize(url, options = {}) ⇒ Crawler
Returns a new instance of Crawler.
21 22 23 24 25 26 27 28 29 |
# File 'lib/site_mapper/crawler.rb', line 21 def initialize(url, = {}) @base_url = Request.resolve_url(url) @options = OPTIONS.dup.merge() @user_agent = @options.fetch(:user_agent) @crawl_url = CrawlUrl.new(@base_url) @fetch_queue = CrawlQueue.new @processed = Set.new @robots = nil end |
Class Method Details
.collect_urls(*args) ⇒ Array
See documentation for the instance variant of this method.
34 35 36 |
# File 'lib/site_mapper/crawler.rb', line 34 def self.collect_urls(*args) new(*args).collect_urls { |url| yield(url) } end |
Instance Method Details
#collect_urls ⇒ Array
Collects all links on domain for domain.
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/site_mapper/crawler.rb', line 48 def collect_urls @fetch_queue << @crawl_url.resolved_base_url until @fetch_queue.empty? || @processed.length >= @options[:max_requests] url = @fetch_queue.pop yield(url) page_urls_for(url) end result = @processed + @fetch_queue Logger.log "Crawling finished:" Logger.log "Processed links: #{@processed.length}" Logger.log "Found links: #{result.length}" result.to_a rescue Interrupt, IRB::Abort Logger.err_log 'Crawl interrupted.' @fetch_queue.to_a end |