Method: Wgit::DSL#crawl_site

Defined in:: lib/wgit/dsl.rb

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

Crawls an entire site using Wgit::Crawler#crawl_site underneath. If no url is provided, then the first start URL is used.

Parameters:

urls (*String, *Wgit::Url) —

The base URL(s) of the website(s) to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host. Defaults to the start URLs.
follow (String) (defaults to: @dsl_follow) —

The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.
allow_paths (String, Array<String>) (defaults to: nil) —

Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —

Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —

Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

(Array<Wgit::Url>, nil) —

Unique Array of external urls collected from all of the site’s pages or nil if the given url could not be crawled successfully.

Raises:

(StandardError) —

If no url is provided and no start URL has been set.

# File 'lib/wgit/dsl.rb', line 130

def crawl_site(
  *urls, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  xpath = follow || :default
  opts  = { follow: xpath, allow_paths:, disallow_paths: }

  urls.reduce([]) do |externals, url|
    externals + get_crawler.crawl_site(Wgit::Url.parse(url), **opts, &block)
  end
end

Method: Wgit::DSL#crawl_site

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r