Class: WaybackArchiver::URLCollector

Inherits:

Object

Object
WaybackArchiver::URLCollector

show all

Defined in:: lib/wayback_archiver/url_collector.rb

Overview

Retrive URLs from different sources

Class Method Summary collapse

.crawl(url, limit: WaybackArchiver.max_limit) ⇒ Array<String>

Retrieve URLs by crawling.
.sitemap(url) ⇒ Array<String>

Retrieve URLs from Sitemap.

Class Method Details

.crawl(url, limit: WaybackArchiver.max_limit) ⇒ `Array<String>`

Retrieve URLs by crawling.

Examples:

Crawl URLs defined on example.com

URLCollector.crawl('http://example.com')

Crawl URLs defined on example.com and limit the number of visited pages to 100

URLCollector.crawl('http://example.com', limit: 100)

Crawl URLs defined on example.com and explicitly set no upper limit on the number of visited pages to 100

URLCollector.crawl('http://example.com', limit: -1)

Parameters:

url (String) —

domain to crawl URLs from.

Returns:

(Array<String>) —

of URLs defined found during crawl.

# File 'lib/wayback_archiver/url_collector.rb', line 28

def self.crawl(url, limit: WaybackArchiver.max_limit)
  urls = []
  start_at_url = Request.build_uri(url).to_s
  options = {
    robots: true,
    user_agent: WaybackArchiver.user_agent
  }
  options[:limit] = limit unless limit == -1

  Spidr.site(start_at_url, options) do |spider|
    spider.every_html_page do |page|
      page_url = page.url.to_s
      urls << page_url
      WaybackArchiver.logger.debug "Found: #{page_url}"
      yield(page_url) if block_given?
    end
  end
  urls
end

.sitemap(url) ⇒ `Array<String>`

Retrieve URLs from Sitemap.