Module: WaybackArchiver
- Defined in:
- lib/wayback_archiver.rb,
lib/wayback_archiver/archive.rb,
lib/wayback_archiver/request.rb,
lib/wayback_archiver/sitemap.rb,
lib/wayback_archiver/version.rb,
lib/wayback_archiver/response.rb,
lib/wayback_archiver/http_code.rb,
lib/wayback_archiver/sitemapper.rb,
lib/wayback_archiver/null_logger.rb,
lib/wayback_archiver/thread_pool.rb,
lib/wayback_archiver/url_collector.rb,
lib/wayback_archiver/archive_result.rb,
lib/wayback_archiver/adapters/wayback_machine.rb
Overview
WaybackArchiver, send URLs to Wayback Machine. By crawling, sitemap or by passing a list of URLs.
Defined Under Namespace
Classes: Archive, ArchiveResult, HTTPCode, NullLogger, Request, Response, Sitemap, Sitemapper, ThreadPool, URLCollector, WaybackMachine
Constant Summary collapse
- INFO_LINK =
Link to gem on rubygems.org, part of the sent User-Agent
'https://rubygems.org/gems/wayback_archiver'.freeze
- USER_AGENT =
WaybackArchiver User-Agent
"WaybackArchiver/#{WaybackArchiver::VERSION} (+#{INFO_LINK})".freeze
- DEFAULT_CONCURRENCY =
Default concurrency for archiving URLs
5
- DEFAULT_MAX_LIMIT =
Maxmium number of links posted (-1 is no limit)
-1
- VERSION =
Gem version
'1.2.1'.freeze
Class Method Summary collapse
-
.adapter ⇒ Integer
Returns the configured adapter.
-
.adapter=(adapter) ⇒ Object, #call
Sets the adapter.
-
.archive(source, legacy_strategy = nil, strategy: :auto, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URLs to Wayback Machine.
-
.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Look for Sitemap(s) and if nothing is found fallback to crawling.
-
.concurrency ⇒ Integer
Returns the default concurrency.
-
.concurrency=(concurrency) ⇒ Integer
Sets the default concurrency.
-
.crawl(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Crawl site for URLs to send to the Wayback Machine.
-
.default_logger! ⇒ NullLogger
Resets the logger to the default.
-
.logger ⇒ Object
Returns the current logger.
-
.logger=(logger) ⇒ Object
Set logger.
-
.max_limit ⇒ Integer
Returns the default max_limit.
-
.max_limit=(max_limit) ⇒ Integer
Sets the default max_limit.
-
.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Get URLs from sitemap and send found URLs to the Wayback Machine.
-
.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URL to the Wayback Machine.
-
.user_agent ⇒ String
Returns the configured user agent.
-
.user_agent=(user_agent) ⇒ String
Sets the user agent.
Class Method Details
.adapter ⇒ Integer
Returns the configured adapter
197 198 199 |
# File 'lib/wayback_archiver.rb', line 197 def self.adapter @adapter ||= WaybackMachine end |
.adapter=(adapter) ⇒ Object, #call
Sets the adapter
187 188 189 190 191 192 193 |
# File 'lib/wayback_archiver.rb', line 187 def self.adapter=(adapter) unless adapter.respond_to?(:call) raise(ArgumentError, 'adapter must implement #call') end @adapter = adapter end |
.archive(source, legacy_strategy = nil, strategy: :auto, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URLs to Wayback Machine.
46 47 48 49 50 51 52 53 54 55 56 57 58 |
# File 'lib/wayback_archiver.rb', line 46 def self.archive(source, legacy_strategy = nil, strategy: :auto, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) strategy = legacy_strategy || strategy case strategy.to_s when 'crawl' then crawl(source, concurrency: concurrency, limit: limit, &block) when 'auto' then auto(source, concurrency: concurrency, limit: limit, &block) when 'sitemap' then sitemap(source, concurrency: concurrency, limit: limit, &block) when 'urls' then urls(source, concurrency: concurrency, limit: limit, &block) when 'url' then urls(source, concurrency: concurrency, limit: limit, &block) else raise ArgumentError, "Unknown strategy: '#{strategy}'. Allowed strategies: sitemap, urls, url, crawl" end end |
.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Look for Sitemap(s) and if nothing is found fallback to crawling. Then send found URLs to the Wayback Machine.
72 73 74 75 76 77 |
# File 'lib/wayback_archiver.rb', line 72 def self.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) urls = Sitemapper.autodiscover(source) return urls(urls, concurrency: concurrency, &block) if urls.any? crawl(source, concurrency: concurrency, &block) end |
.concurrency ⇒ Integer
Returns the default concurrency
167 168 169 |
# File 'lib/wayback_archiver.rb', line 167 def self.concurrency @concurrency ||= DEFAULT_CONCURRENCY end |
.concurrency=(concurrency) ⇒ Integer
Sets the default concurrency
161 162 163 |
# File 'lib/wayback_archiver.rb', line 161 def self.concurrency=(concurrency) @concurrency = concurrency end |
.crawl(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Crawl site for URLs to send to the Wayback Machine.
89 90 91 92 |
# File 'lib/wayback_archiver.rb', line 89 def self.crawl(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) WaybackArchiver.logger.info "Crawling #{url}" Archive.crawl(url, concurrency: concurrency, limit: limit, &block) end |
.default_logger! ⇒ NullLogger
Resets the logger to the default
141 142 143 |
# File 'lib/wayback_archiver.rb', line 141 def self.default_logger! @logger = NullLogger.new end |
.logger ⇒ Object
Returns the current logger
135 136 137 |
# File 'lib/wayback_archiver.rb', line 135 def self.logger @logger ||= NullLogger.new end |
.logger=(logger) ⇒ Object
Set logger
129 130 131 |
# File 'lib/wayback_archiver.rb', line 129 def self.logger=(logger) @logger = logger end |
.max_limit ⇒ Integer
Returns the default max_limit
180 181 182 |
# File 'lib/wayback_archiver.rb', line 180 def self.max_limit @max_limit ||= DEFAULT_MAX_LIMIT end |
.max_limit=(max_limit) ⇒ Integer
Sets the default max_limit
174 175 176 |
# File 'lib/wayback_archiver.rb', line 174 def self.max_limit=(max_limit) @max_limit = max_limit end |
.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Get URLs from sitemap and send found URLs to the Wayback Machine.
105 106 107 108 |
# File 'lib/wayback_archiver.rb', line 105 def self.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) WaybackArchiver.logger.info "Fetching Sitemap" Archive.post(URLCollector.sitemap(url), concurrency: concurrency, limit: limit, &block) end |
.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URL to the Wayback Machine.
120 121 122 |
# File 'lib/wayback_archiver.rb', line 120 def self.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) Archive.post(Array(urls), concurrency: concurrency, &block) end |
.user_agent ⇒ String
Returns the configured user agent
154 155 156 |
# File 'lib/wayback_archiver.rb', line 154 def self.user_agent @user_agent ||= USER_AGENT end |
.user_agent=(user_agent) ⇒ String
Sets the user agent
148 149 150 |
# File 'lib/wayback_archiver.rb', line 148 def self.user_agent=(user_agent) @user_agent = user_agent end |