Class: WaybackArchiver::URLCollector

Inherits:
Object
  • Object
show all
Defined in:
lib/wayback_archiver/url_collector.rb

Overview

Retrive URLs from different sources

Class Method Summary collapse

Class Method Details

.crawl(url, limit: WaybackArchiver.max_limit) ⇒ Array<String>

Retrieve URLs by crawling.

Examples:

Crawl URLs defined on example.com

URLCollector.crawl('http://example.com')

Crawl URLs defined on example.com and limit the number of visited pages to 100

URLCollector.crawl('http://example.com', limit: 100)

Crawl URLs defined on example.com and explicitly set no upper limit on the number of visited pages to 100

URLCollector.crawl('http://example.com', limit: -1)

Parameters:

  • url (String)

    domain to crawl URLs from.

Returns:

  • (Array<String>)

    of URLs defined found during crawl.



28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'lib/wayback_archiver/url_collector.rb', line 28

def self.crawl(url, limit: WaybackArchiver.max_limit)
  urls = []
  start_at_url = Request.build_uri(url).to_s
  options = {
    robots: true,
    user_agent: WaybackArchiver.user_agent
  }
  options[:limit] = limit unless limit == -1

  Spidr.site(start_at_url, options) do |spider|
    spider.every_html_page do |page|
      page_url = page.url.to_s
      urls << page_url
      WaybackArchiver.logger.debug "Found: #{page_url}"
      yield(page_url) if block_given?
    end
  end
  urls
end

.sitemap(url) ⇒ Array<String>

Retrieve URLs from Sitemap.

Examples:

Get URLs defined in Sitemap for google.com

URLCollector.sitemap('https://google.com/sitemap.xml')

Parameters:

  • url (String)

    domain to retrieve Sitemap from.

Returns:

  • (Array<String>)

    of URLs defined in Sitemap.



15
16
17
# File 'lib/wayback_archiver/url_collector.rb', line 15

def self.sitemap(url)
  Sitemapper.urls(url: Request.build_uri(url))
end