Class: SiteMapper::CrawlUrl

Inherits:
Object
  • Object
show all
Defined in:
lib/site_mapper/crawl_url.rb

Overview

Crawl URL formatter.

Constant Summary collapse

TOO_MANY_REQUEST_MSG =

Too many request error message

"You're being challenged with a 'too many requests' captcha"

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(base_url) ⇒ CrawlUrl

Initialize CrawlUrl

Examples:

Intitialize CrawlUrl with example.com as base_url

CrawlUrl.new('example.com')

Parameters:

  • base_url (String)


13
14
15
16
# File 'lib/site_mapper/crawl_url.rb', line 13

def initialize(base_url)
  @resolved_base_url = Request.resolve_url(base_url)
  @base_hostname     = URI.parse(@resolved_base_url).hostname
end

Instance Attribute Details

#resolved_base_urlObject (readonly)

Returns the value of attribute resolved_base_url.



4
5
6
# File 'lib/site_mapper/crawl_url.rb', line 4

def resolved_base_url
  @resolved_base_url
end

Instance Method Details

#absolute_url_from(page_url, current_url) ⇒ String

Given a link it constructs the absolute path, if valid URL & URL has same domain as @resolved_base_url.

Examples:

Construct absolute URL for ‘/path’, example.com

cu = CrawlUrl.new('example.com')
cu.absolute_url_from('/path', 'example.com/some/path')
# => http://example.com/some/path

Parameters:

  • page_url (String)

    url found on page

  • current_url (String)

    current page url

Returns:

  • (String)

    with absolute path to resource



27
28
29
30
31
32
# File 'lib/site_mapper/crawl_url.rb', line 27

def absolute_url_from(page_url, current_url)
  return unless eligible_url?(page_url)
  parsed_uri = URI.join(current_url, page_url) rescue return
  return unless parsed_uri.hostname == @base_hostname
  parsed_uri.to_s
end