Module: WaybackArchiver

Defined in:
lib/wayback_archiver.rb,
lib/wayback_archiver/archive.rb,
lib/wayback_archiver/request.rb,
lib/wayback_archiver/sitemap.rb,
lib/wayback_archiver/version.rb,
lib/wayback_archiver/response.rb,
lib/wayback_archiver/http_code.rb,
lib/wayback_archiver/sitemapper.rb,
lib/wayback_archiver/null_logger.rb,
lib/wayback_archiver/thread_pool.rb,
lib/wayback_archiver/url_collector.rb,
lib/wayback_archiver/archive_result.rb,
lib/wayback_archiver/adapters/wayback_machine.rb

Overview

WaybackArchiver, send URLs to Wayback Machine. By crawling, sitemap or by passing a list of URLs.

Defined Under Namespace

Classes: Archive, ArchiveResult, HTTPCode, NullLogger, Request, Response, Sitemap, Sitemapper, ThreadPool, URLCollector, WaybackMachine

Constant Summary collapse

'https://rubygems.org/gems/wayback_archiver'.freeze
USER_AGENT =

WaybackArchiver User-Agent

"WaybackArchiver/#{WaybackArchiver::VERSION} (+#{INFO_LINK})".freeze
DEFAULT_CONCURRENCY =

Default concurrency for archiving URLs

5
DEFAULT_MAX_LIMIT =

Maxmium number of links posted (-1 is no limit)

-1
VERSION =

Gem version

'1.2.1'.freeze

Class Method Summary collapse

Class Method Details

.adapterInteger

Returns the configured adapter

Returns:

  • (Integer)

    the configured or the default adapter



197
198
199
# File 'lib/wayback_archiver.rb', line 197

def self.adapter
  @adapter ||= WaybackMachine
end

.adapter=(adapter) ⇒ Object, #call

Sets the adapter

Parameters:

  • ] (Object, #call)

    the adapter

Returns:

  • (Object, #call)

    ] the configured adapter



187
188
189
190
191
192
193
# File 'lib/wayback_archiver.rb', line 187

def self.adapter=(adapter)
  unless adapter.respond_to?(:call)
    raise(ArgumentError, 'adapter must implement #call')
  end

  @adapter = adapter
end

.archive(source, legacy_strategy = nil, strategy: :auto, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Send URLs to Wayback Machine.

Examples:

Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com') # Default strategy is :auto
WaybackArchiver.archive('example.com', strategy: :auto)
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :auto, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :auto)

Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com', strategy: :crawl)
WaybackArchiver.archive('example.com', strategy: :crawl, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :crawl, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :crawl)

Send example.com Sitemap URLs

WaybackArchiver.archive('example.com', strategy: :sitemap)
WaybackArchiver.archive('example.com', strategy: :sitemap, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :sitemap, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :sitemap)

Send only example.com

WaybackArchiver.archive('example.com', strategy: :url)
WaybackArchiver.archive('example.com', strategy: :url, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :url, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :url)

Parameters:

  • source (String/Array<String>)

    for URL(s).

  • strategy (String/Symbol) (defaults to: :auto)

    of source. Supported strategies: crawl, sitemap, url, urls, auto.

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.



46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/wayback_archiver.rb', line 46

def self.archive(source, legacy_strategy = nil, strategy: :auto, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  strategy = legacy_strategy || strategy

  case strategy.to_s
  when 'crawl'   then crawl(source, concurrency: concurrency, limit: limit, &block)
  when 'auto'    then auto(source, concurrency: concurrency, limit: limit, &block)
  when 'sitemap' then sitemap(source, concurrency: concurrency, limit: limit, &block)
  when 'urls'    then urls(source, concurrency: concurrency, limit: limit, &block)
  when 'url'     then urls(source, concurrency: concurrency, limit: limit, &block)
  else
    raise ArgumentError, "Unknown strategy: '#{strategy}'. Allowed strategies: sitemap, urls, url, crawl"
  end
end

.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Look for Sitemap(s) and if nothing is found fallback to crawling. Then send found URLs to the Wayback Machine.

Examples:

Auto archive example.com

WaybackArchiver.auto('example.com') # Default concurrency is 5

Auto archive example.com with low concurrency

WaybackArchiver.auto('example.com', concurrency: 1)

Auto archive example.com and archive max 100 URLs

WaybackArchiver.auto('example.com', limit: 100)

Parameters:

  • source (String)

    (must be a valid URL).

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.

See Also:



72
73
74
75
76
77
# File 'lib/wayback_archiver.rb', line 72

def self.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  urls = Sitemapper.autodiscover(source)
  return urls(urls, concurrency: concurrency, &block) if urls.any?

  crawl(source, concurrency: concurrency, &block)
end

.concurrencyInteger

Returns the default concurrency

Returns:

  • (Integer)

    the configured or the default concurrency



167
168
169
# File 'lib/wayback_archiver.rb', line 167

def self.concurrency
  @concurrency ||= DEFAULT_CONCURRENCY
end

.concurrency=(concurrency) ⇒ Integer

Sets the default concurrency

Parameters:

  • concurrency (Integer)

    the desired default concurrency

Returns:

  • (Integer)

    the desired default concurrency



161
162
163
# File 'lib/wayback_archiver.rb', line 161

def self.concurrency=(concurrency)
  @concurrency = concurrency
end

.crawl(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Crawl site for URLs to send to the Wayback Machine.

Examples:

Crawl example.com and send all URLs of the same domain

WaybackArchiver.crawl('example.com') # Default concurrency is 5

Crawl example.com and send all URLs of the same domain with low concurrency

WaybackArchiver.crawl('example.com', concurrency: 1)

Crawl example.com and archive max 100 URLs

WaybackArchiver.crawl('example.com', limit: 100)

Parameters:

  • url (String)

    to start crawling from.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.



89
90
91
92
# File 'lib/wayback_archiver.rb', line 89

def self.crawl(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Crawling #{url}"
  Archive.crawl(url, concurrency: concurrency, limit: limit, &block)
end

.default_logger!NullLogger

Resets the logger to the default

Returns:



141
142
143
# File 'lib/wayback_archiver.rb', line 141

def self.default_logger!
  @logger = NullLogger.new
end

.loggerObject

Returns the current logger

Returns:

  • (Object)

    the current logger instance



135
136
137
# File 'lib/wayback_archiver.rb', line 135

def self.logger
  @logger ||= NullLogger.new
end

.logger=(logger) ⇒ Object

Set logger

Examples:

set a logger that prints to standard out (STDOUT)

WaybackArchiver.logger = Logger.new(STDOUT)

Parameters:

  • logger (Object)

    an object than response to quacks like a Logger

Returns:

  • (Object)

    the set logger



129
130
131
# File 'lib/wayback_archiver.rb', line 129

def self.logger=(logger)
  @logger = logger
end

.max_limitInteger

Returns the default max_limit

Returns:

  • (Integer)

    the configured or the default max_limit



180
181
182
# File 'lib/wayback_archiver.rb', line 180

def self.max_limit
  @max_limit ||= DEFAULT_MAX_LIMIT
end

.max_limit=(max_limit) ⇒ Integer

Sets the default max_limit

Parameters:

  • max_limit (Integer)

    the desired default max_limit

Returns:

  • (Integer)

    the desired default max_limit



174
175
176
# File 'lib/wayback_archiver.rb', line 174

def self.max_limit=(max_limit)
  @max_limit = max_limit
end

.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Get URLs from sitemap and send found URLs to the Wayback Machine.

Examples:

Get example.com sitemap and archive all found URLs

WaybackArchiver.sitemap('example.com/sitemap.xml') # Default concurrency is 5

Get example.com sitemap and archive all found URLs with low concurrency

WaybackArchiver.sitemap('example.com/sitemap.xml', concurrency: 1)

Get example.com sitemap archive max 100 URLs

WaybackArchiver.sitemap('example.com/sitemap.xml', limit: 100)

Parameters:

  • url (String)

    to the sitemap.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.

See Also:



105
106
107
108
# File 'lib/wayback_archiver.rb', line 105

def self.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Fetching Sitemap"
  Archive.post(URLCollector.sitemap(url), concurrency: concurrency, limit: limit, &block)
end

.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Send URL to the Wayback Machine.

Examples:

Archive example.com

WaybackArchiver.urls('example.com')

Archive example.com and google.com

WaybackArchiver.urls(%w(example.com google.com))

Archive example.com, max 100 URLs

WaybackArchiver.urls(%w(example.com www.example.com), limit: 100)

Parameters:

  • urls (Array<String>/String)

    or url.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.



120
121
122
# File 'lib/wayback_archiver.rb', line 120

def self.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  Archive.post(Array(urls), concurrency: concurrency, &block)
end

.user_agentString

Returns the configured user agent

Returns:

  • (String)

    the configured or the default user agent



154
155
156
# File 'lib/wayback_archiver.rb', line 154

def self.user_agent
  @user_agent ||= USER_AGENT
end

.user_agent=(user_agent) ⇒ String

Sets the user agent

Parameters:

  • user_agent (String)

    the desired user agent

Returns:

  • (String)

    the configured user agent



148
149
150
# File 'lib/wayback_archiver.rb', line 148

def self.user_agent=(user_agent)
  @user_agent = user_agent
end