Method: Wgit::DSL#index_site

Defined in:
lib/wgit/dsl.rb

#index_site(*urls, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r

Indexes a single website using Wgit::Indexer#index_site underneath.

Parameters:

  • urls (*String, *Wgit::Url)

    The base URL(s) of the website(s) to crawl. Can be set using start.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the website's external URL's into the database.

  • follow (String) (defaults to: @dsl_follow)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given the Wgit::Document of each crawled webpage, before it is inserted into the database allowing for prior manipulation.

Returns:

  • (Integer)

    The total number of pages crawled within the website.

Raises:

  • (StandardError)

    If no url is provided and no start URL has been set.



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
# File 'lib/wgit/dsl.rb', line 208

def index_site(
  *urls, insert_externals: false, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  indexer    = Wgit::Indexer.new(get_db, get_crawler)
  xpath      = follow || :default
  crawl_opts = {
    insert_externals:, follow: xpath, allow_paths:, disallow_paths:
  }

  urls.reduce(0) do |total, url|
    total + indexer.index_site(Wgit::Url.parse(url), **crawl_opts, &block)
  end
end