Method: Wgit::DSL#crawl_site

Defined in:
lib/wgit/dsl.rb

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

Crawls an entire site using Wgit::Crawler#crawl_site underneath. If no url is provided, then the first start URL is used.

Parameters:

  • urls (*String, *Wgit::Url)

    The base URL(s) of the website(s) to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host. Defaults to the start URLs.

  • follow (String) (defaults to: @dsl_follow)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

  • (Array<Wgit::Url>, nil)

    Unique Array of external urls collected from all of the site’s pages or nil if the given url could not be crawled successfully.

Raises:

  • (StandardError)

    If no url is provided and no start URL has been set.



130
131
132
133
134
135
136
137
138
139
140
141
142
143
# File 'lib/wgit/dsl.rb', line 130

def crawl_site(
  *urls, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  xpath = follow || :default
  opts  = { follow: xpath, allow_paths:, disallow_paths: }

  urls.reduce([]) do |externals, url|
    externals + get_crawler.crawl_site(Wgit::Url.parse(url), **opts, &block)
  end
end