Module: Wgit::DSL

Included in:: Base

Defined in:: lib/wgit/dsl.rb

Overview

DSL methods that act as a wrapper around Wgit's underlying class methods. All instance vars/constants are prefixed to avoid conflicts when included.

Constant Summary collapse

DSL_ERROR__NO_START_URL = Error message shown when there's no URL to crawl.

"missing url, pass as parameter to this or \
the 'start' function".freeze

Instance Method Summary collapse

#clear_db!(connection_string: @dsl_conn_str) ⇒ Integer
Deletes everything in the urls and documents collections by calling Wgit::Database#clear_db underneath.
#connection_string(conn_str) ⇒ Object
Defines the connection string to the database used in subsequent index* method calls.
#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document (also: #crawl_url)
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath.
#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>^? (also: #crawl_r)
Crawls an entire site using Wgit::Crawler#crawl_site underneath.
#crawler {|crawler| ... } ⇒ Wgit::Crawler
Initializes a Wgit::Crawler.
#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extractor using Wgit::Document.define_extractor underneath.
#follow(xpath) ⇒ Object
Sets the xpath to be followed when crawl_site or index_site is subsequently called.
#index(*urls, connection_string: @dsl_conn_str, insert_externals: false) {|doc| ... } ⇒ Object
Indexes a single webpage using Wgit::Indexer#index_url underneath.
#index_site(*urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer (also: #index_r)
Indexes a single website using Wgit::Indexer#index_site underneath.
#index_www(connection_string: @dsl_conn_str, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Indexes the World Wide Web using Wgit::Indexer#index_www underneath.
#last_response ⇒ Wgit::Response
Returns the DSL's crawler#last_response.
#reset ⇒ Object
Nilifies the DSL instance variables.
#search(query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Performs a search of the database's indexed documents and pretty prints the results in a search engine-esque format.
#start(*urls) {|crawler| ... } ⇒ Object (also: #start_urls)
Sets the URL to be crawled when a crawl* or index* method is subsequently called.

Instance Method Details

#clear_db!(connection_string: @dsl_conn_str) ⇒ `Integer`

Deletes everything in the urls and documents collections by calling Wgit::Database#clear_db underneath. This will nuke the entire database so yeah... be careful.

Returns:

(Integer) —
The number of deleted records.

# File 'lib/wgit/dsl.rb', line 315

def clear_db!(connection_string: @dsl_conn_str)
  db = Wgit::Database.new(connection_string)
  db.clear_db
end

#connection_string(conn_str) ⇒ `Object`

Defines the connection string to the database used in subsequent index* method calls. This method is optional as the connection string can be passed to the index method instead.

Parameters:

conn_str (String) —
The connection string used to connect to the database in subsequent index* method calls.



170
171
172

# File 'lib/wgit/dsl.rb', line 170

def connection_string(conn_str)
  @dsl_conn_str = conn_str
end

#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl_url

Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. If no urls are provided, then the start URL is used.

Parameters:

urls (*Wgit::Url) —
The URL's to crawl. Defaults to the start URL(s).
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param. This value will be used for all urls crawled.

Yields:

(doc) —
Given each crawled page (Wgit::Document); this is the only way to interact with them.

Returns:

(Wgit::Document) —
The last Document crawled.

Raises:

(StandardError) —
If no urls are provided and no start URL has been set.

# File 'lib/wgit/dsl.rb', line 99

def crawl(*urls, follow_redirects: true, &block)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  urls.map! { |url| Wgit::Url.parse(url) }
  crawler.crawl_urls(*urls, follow_redirects: follow_redirects, &block)
end

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

Crawls an entire site using Wgit::Crawler#crawl_site underneath. If no url is provided, then the first start URL is used.

Parameters:

urls (*String, *Wgit::Url) —
The base URL(s) of the website(s) to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host. Defaults to the start URLs.
follow (String) (defaults to: @dsl_follow) —
The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.
allow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —
Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

(Array<Wgit::Url>, nil) —
Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.

Raises:

(StandardError) —
If no url is provided and no start URL has been set.

# File 'lib/wgit/dsl.rb', line 130

def crawl_site(
  *urls, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  xpath = follow || :default
  opts  = {
    follow: xpath, allow_paths: allow_paths, disallow_paths: disallow_paths
  }

  urls.reduce([]) do |externals, url|
    externals + crawler.crawl_site(Wgit::Url.parse(url), **opts, &block)
  end
end

#crawler {|crawler| ... } ⇒ `Wgit::Crawler`

Initializes a Wgit::Crawler. This crawler is then used in all crawl and index methods used by the DSL. See the Wgit::Crawler documentation for more details.

Yields:

(crawler) —
The created crawler; use the block to configure.

Returns:

(Wgit::Crawler) —
The created crawler used by the DSL.

# File 'lib/wgit/dsl.rb', line 53

def crawler
  @dsl_crawler ||= Wgit::Crawler.new
  yield @dsl_crawler if block_given?
  @dsl_crawler
end

#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ `Symbol`

Defines an extractor using Wgit::Document.define_extractor underneath.

Parameters:

var (Symbol) —
The name of the variable to be initialised, that will contain the extracted content.
xpath (String, #call) —
The xpath used to find the element(s) of the webpage. Only used when initializing from HTML.

Pass a callable object (proc etc.) if you want the xpath value to be derived on Document initialisation (instead of when the extractor is defined). The call method must return a valid xpath String.
opts (Hash) (defaults to: {}) —
The options to define an extractor with. The options are only used when intializing from HTML, not the database.

Options Hash (opts):

:singleton (Boolean) —
The singleton option determines whether or not the result(s) should be in an Array. If multiple results are found and singleton is true then the first result will be used. Defaults to true.
:text_content_only (Boolean) —
The text_content_only option if true will use the text content of the Nokogiri result object, otherwise the Nokogiri object itself is returned. Defaults to true.

Yields:

The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

value (Object) —
The result value to be assigned to the new var.
source (Wgit::Document, Object) —
The source of the value.
type (Symbol) —
The source type, either :document or (DB) :object.

Yield Returns:

(Object) —
The return value of the block becomes the new var's value. Return the block's value param unchanged if you want to inspect.

Returns:

(Symbol) —
The given var Symbol if successful.

Raises:

(StandardError) —
If the var param isn't valid.



43
44
45

# File 'lib/wgit/dsl.rb', line 43

def extract(var, xpath, opts = {}, &block)
  Wgit::Document.define_extractor(var, xpath, opts, &block)
end

#follow(xpath) ⇒ `Object`

Sets the xpath to be followed when crawl_site or index_site is subsequently called. Calling this method is optional as the default is to follow all <a> href's that point to the site domain. You can also pass follow: to the crawl/index methods directly.

Parameters:

xpath (String) —
The xpath which is followed when crawling/indexing a site. Use :default to restore the default follow logic.



80
81
82

# File 'lib/wgit/dsl.rb', line 80

def follow(xpath)
  @dsl_follow = xpath
end

#index(*urls, connection_string: @dsl_conn_str, insert_externals: false) {|doc| ... } ⇒ `Object`

Indexes a single webpage using Wgit::Indexer#index_url underneath.

Parameters:

urls (*Wgit::Url) —
The webpage URL's to crawl. Defaults to the start URL(s).
connection_string (String) (defaults to: @dsl_conn_str) —
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'] or set using connection_string.
insert_externals (Boolean) (defaults to: false) —
Whether or not to insert the website's external URL's into the database.

Yields:

(doc) —
Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.

Raises:

(StandardError) —
If no urls are provided and no start URL has been set.

# File 'lib/wgit/dsl.rb', line 253

def index(
  *urls, connection_string: @dsl_conn_str,
  insert_externals: false, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  db      = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db, crawler)

  urls.map! { |url| Wgit::Url.parse(url) }
  indexer.index_urls(*urls, insert_externals: insert_externals, &block)
end

#index_site(*urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Integer` Also known as: index_r

Indexes a single website using Wgit::Indexer#index_site underneath.

Parameters:

urls (*String, *Wgit::Url) —
The base URL(s) of the website(s) to crawl. Can be set using start.
connection_string (String) (defaults to: @dsl_conn_str) —
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'] or set using connection_string.
insert_externals (Boolean) (defaults to: false) —
Whether or not to insert the website's external URL's into the database.
follow (String) (defaults to: @dsl_follow) —
The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.
allow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —
Given the Wgit::Document of each crawled webpage, before it is inserted into the database allowing for prior manipulation.

Returns:

(Integer) —
The total number of pages crawled within the website.

Raises:

(StandardError) —
If no url is provided and no start URL has been set.

# File 'lib/wgit/dsl.rb', line 217

def index_site(
  *urls, connection_string: @dsl_conn_str,
  insert_externals: false, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  db         = Wgit::Database.new(connection_string)
  indexer    = Wgit::Indexer.new(db, crawler)
  xpath      = follow || :default
  crawl_opts = {
    insert_externals: insert_externals, follow: xpath,
    allow_paths: allow_paths, disallow_paths: disallow_paths
  }

  urls.reduce(0) do |total, url|
    total + indexer.index_site(Wgit::Url.parse(url), **crawl_opts, &block)
  end
end

#index_www(connection_string: @dsl_conn_str, max_sites: -1,, max_data: 1_048_576_000) ⇒ `Object`

Indexes the World Wide Web using Wgit::Indexer#index_www underneath.

Parameters:

connection_string (String) (defaults to: @dsl_conn_str) —
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'] or set using connection_string.
max_sites (Integer) (defaults to: -1,) —
The number of separate and whole websites to be crawled before the method exits. Defaults to -1 which means the crawl will occur until manually stopped (Ctrl+C etc).
max_data (Integer) (defaults to: 1_048_576_000) —
The maximum amount of bytes that will be scraped from the web (default is 1GB). Note, that this value is used to determine when to stop crawling; it's not a guarantee of the max data that will be obtained.

# File 'lib/wgit/dsl.rb', line 186

def index_www(
  connection_string: @dsl_conn_str, max_sites: -1, max_data: 1_048_576_000
)
  db      = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db, crawler)

  indexer.index_www(max_sites: max_sites, max_data: max_data)
end

#last_response ⇒ `Wgit::Response`

Returns the DSL's crawler#last_response.

Returns:

(Wgit::Response) —
The response from the last URL crawled.



150
151
152

# File 'lib/wgit/dsl.rb', line 150

def last_response
  crawler.last_response
end

#reset ⇒ `Object`

Nilifies the DSL instance variables.

# File 'lib/wgit/dsl.rb', line 155

def reset
  @dsl_crawler  = nil
  @dsl_start    = nil
  @dsl_follow   = nil
  @dsl_conn_str = nil
end

#search(query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ `Array<Wgit::Document>`

Performs a search of the database's indexed documents and pretty prints the results in a search engine-esque format. See Wgit::Database#search! and Wgit::Document#search! for details of how the search works.

Parameters:

query (String) —
The text query to search with.
connection_string (String) (defaults to: @dsl_conn_str) —
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'] or set using connection_string.
stream (nil, #puts) (defaults to: STDOUT) —
Any object that respond_to?(:puts). It is used to output text somewhere e.g. a file or STDERR. Use nil for no output.
case_sensitive (Boolean) (defaults to: false) —
Whether character case must match.
whole_sentence (Boolean) (defaults to: true) —
Whether multiple words should be searched for separately.
limit (Integer) (defaults to: 10) —
The max number of results to print.
skip (Integer) (defaults to: 0) —
The number of DB records to skip.
sentence_limit (Integer) (defaults to: 80) —
The max length of each result's text snippet.

Yields:

(doc) —
Given each search result (Wgit::Document) returned from the database containing only its matching #text.

Returns:

(Array<Wgit::Document>) —
The search results with matching text.

# File 'lib/wgit/dsl.rb', line 287

def search(
  query, connection_string: @dsl_conn_str, stream: STDOUT,
  case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, sentence_limit: 80, &block
)
  stream ||= File.open(File::NULL, 'w')
  db = Wgit::Database.new(connection_string)

  results = db.search!(
    query,
    case_sensitive: case_sensitive,
    whole_sentence: whole_sentence,
    limit: limit,
    skip: skip,
    sentence_limit: sentence_limit,
    &block
  )

  Wgit::Utils.printf_search_results(results, stream: stream)

  results
end

#start(*urls) {|crawler| ... } ⇒ `Object` Also known as: start_urls

Sets the URL to be crawled when a crawl* or index* method is subsequently called. Calling this is optional as the URL can be passed to the method instead. You can also omit the url param and just use the block to configure the crawler instead.

Parameters:

urls (*String, *Wgit::Url) —
The URL(s) to crawl or nil (if only using the block to configure the crawler).

Yields:

(crawler) —
The crawler that'll be used in the subsequent crawl/index; use the block to configure.

# File 'lib/wgit/dsl.rb', line 68

def start(*urls, &block)
  crawler(&block)
  @dsl_start = urls
end

Module: Wgit::DSL

Overview

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#clear_db!(connection_string: @dsl_conn_str) ⇒ Integer

#connection_string(conn_str) ⇒ Object

#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_url

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

#crawler {|crawler| ... } ⇒ Wgit::Crawler

#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol

#follow(xpath) ⇒ Object

#index(*urls, connection_string: @dsl_conn_str, insert_externals: false) {|doc| ... } ⇒ Object

#index_site(*urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r

#index_www(connection_string: @dsl_conn_str, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object

#last_response ⇒ Wgit::Response

#reset ⇒ Object

#search(query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>

#start(*urls) {|crawler| ... } ⇒ Object Also known as: start_urls

#clear_db!(connection_string: @dsl_conn_str) ⇒ `Integer`

#connection_string(conn_str) ⇒ `Object`

#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl_url

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

#crawler {|crawler| ... } ⇒ `Wgit::Crawler`

#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ `Symbol`

#follow(xpath) ⇒ `Object`

#index(*urls, connection_string: @dsl_conn_str, insert_externals: false) {|doc| ... } ⇒ `Object`

#index_site(*urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Integer` Also known as: index_r

#index_www(connection_string: @dsl_conn_str, max_sites: -1,, max_data: 1_048_576_000) ⇒ `Object`

#last_response ⇒ `Wgit::Response`

#reset ⇒ `Object`

#search(query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ `Array<Wgit::Document>`

#start(*urls) {|crawler| ... } ⇒ `Object` Also known as: start_urls