Class: Arachnid2

Inherits:

Object

Object
Arachnid2

Defined in:: lib/arachnid2.rb,
lib/arachnid2/version.rb

Constant Summary collapse

MAX_CRAWL_TIME = META: About the origins of this crawling approach The Crawler is heavily borrowed from by Arachnid. Original: github.com/dchuk/Arachnid Other iterations I’ve borrowed liberally from: - https://github.com/matstc/Arachnid - https://github.com/intrigueio/Arachnid - https://github.com/jhulme/Arachnid And this was originally written as a part of Tellurion’s bot github.com/samnissen/tellurion_bot

BASE_CRAWL_TIME =

MAX_URLS =

BASE_URLS =

DEFAULT_LANGUAGE =

"en-IE, en-UK;q=0.9, en-NL;q=0.8, en-MT;q=0.7, en-LU;q=0.6, en;q=0.5, *;0.4"

DEFAULT_USER_AGENT =

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15"

DEFAULT_NON_HTML_EXTENSIONS =

{
  3 => ['.gz'],
  4 => ['.jpg', '.png', '.m4a', '.mp3', '.mp4', '.pdf', '.zip',
        '.wmv', '.gif', '.doc', '.xls', '.pps', '.ppt', '.tar',
        '.iso', '.dmg', '.bin', '.ics', '.exe', '.wav', '.mid'],
  5 => ['.xlsx', '.docx', '.pptx', '.tiff', '.zipx'],
  8 => ['.torrent']
}

MEMORY_USE_FILE =

"/sys/fs/cgroup/memory/memory.usage_in_bytes"

MEMORY_LIMIT_FILE =

"/sys/fs/cgroup/memory/memory.limit_in_bytes"

DEFAULT_MAXIMUM_LOAD_RATE =

79.9

DEFAULT_TIMEOUT =

10_000

MINIMUM_TIMEOUT =

MAXIMUM_TIMEOUT =

999_999

VERSION =

"0.1.4"

Instance Method Summary collapse

#crawl(opts = {}) ⇒ Object

Visits a URL, gathering links and visiting them, until running out of time, memory or attempts.
#initialize(url) ⇒ Arachnid2 constructor

Creates the object to execute the crawl.

Constructor Details

#initialize(url) ⇒ `Arachnid2`

Creates the object to execute the crawl

Examples:

url = "https://daringfireball.net"
spider = Arachnid2.new(url)

Parameters:

url (String)

# File 'lib/arachnid2.rb', line 57

def initialize(url)
  @url = url
  @domain = Adomain[@url]
end

Instance Method Details

#crawl(opts = {}) ⇒ `Object`

Visits a URL, gathering links and visiting them, until running out of time, memory or attempts.