Module: Wgit::DSL
- Included in:
- Base
- Defined in:
- lib/wgit/dsl.rb
Overview
DSL methods that act as a wrapper around Wgit's underlying class methods. All instance vars/constants are prefixed to avoid conflicts when included.
Constant Summary collapse
- DSL_ERROR__NO_START_URL =
Error message shown when there's no URL to crawl.
"missing url, pass as parameter to this or \ the 'start' function".freeze
Instance Method Summary collapse
-
#clear_db!(connection_string: @dsl_conn_str) ⇒ Integer
Deletes everything in the urls and documents collections by calling
Wgit::Database#clear_dbunderneath. -
#connection_string(conn_str) ⇒ Object
Defines the connection string to the database used in subsequent
index*method calls. -
#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document
(also: #crawl_url)
Crawls one or more individual urls using
Wgit::Crawler#crawl_urlunderneath. -
#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>?
(also: #crawl_r)
Crawls an entire site using
Wgit::Crawler#crawl_siteunderneath. -
#crawler {|crawler| ... } ⇒ Wgit::Crawler
Initializes a
Wgit::Crawler. -
#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extractor using
Wgit::Document.define_extractorunderneath. -
#follow(xpath) ⇒ Object
Sets the xpath to be followed when
crawl_siteorindex_siteis subsequently called. -
#index(*urls, connection_string: @dsl_conn_str, insert_externals: false) {|doc| ... } ⇒ Object
Indexes a single webpage using
Wgit::Indexer#index_urlunderneath. -
#index_site(*urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
(also: #index_r)
Indexes a single website using
Wgit::Indexer#index_siteunderneath. -
#index_www(connection_string: @dsl_conn_str, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Indexes the World Wide Web using
Wgit::Indexer#index_wwwunderneath. -
#last_response ⇒ Wgit::Response
Returns the DSL's
crawler#last_response. -
#reset ⇒ Object
Nilifies the DSL instance variables.
-
#search(query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Performs a search of the database's indexed documents and pretty prints the results in a search engine-esque format.
-
#start(*urls) {|crawler| ... } ⇒ Object
(also: #start_urls)
Sets the URL to be crawled when a
crawl*orindex*method is subsequently called.
Instance Method Details
#clear_db!(connection_string: @dsl_conn_str) ⇒ Integer
Deletes everything in the urls and documents collections by calling
Wgit::Database#clear_db underneath. This will nuke the entire database
so yeah... be careful.
315 316 317 318 |
# File 'lib/wgit/dsl.rb', line 315 def clear_db!(connection_string: @dsl_conn_str) db = Wgit::Database.new(connection_string) db.clear_db end |
#connection_string(conn_str) ⇒ Object
Defines the connection string to the database used in subsequent index*
method calls. This method is optional as the connection string can be
passed to the index method instead.
170 171 172 |
# File 'lib/wgit/dsl.rb', line 170 def connection_string(conn_str) @dsl_conn_str = conn_str end |
#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_url
Crawls one or more individual urls using Wgit::Crawler#crawl_url
underneath. If no urls are provided, then the start URL is used.
99 100 101 102 103 104 105 |
# File 'lib/wgit/dsl.rb', line 99 def crawl(*urls, follow_redirects: true, &block) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? urls.map! { |url| Wgit::Url.parse(url) } crawler.crawl_urls(*urls, follow_redirects: follow_redirects, &block) end |
#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r
Crawls an entire site using Wgit::Crawler#crawl_site underneath. If no
url is provided, then the first start URL is used.
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/wgit/dsl.rb', line 130 def crawl_site( *urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil, &block ) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? xpath = follow || :default opts = { follow: xpath, allow_paths: allow_paths, disallow_paths: disallow_paths } urls.reduce([]) do |externals, url| externals + crawler.crawl_site(Wgit::Url.parse(url), **opts, &block) end end |
#crawler {|crawler| ... } ⇒ Wgit::Crawler
Initializes a Wgit::Crawler. This crawler is then used in all crawl and
index methods used by the DSL. See the Wgit::Crawler documentation for
more details.
53 54 55 56 57 |
# File 'lib/wgit/dsl.rb', line 53 def crawler @dsl_crawler ||= Wgit::Crawler.new yield @dsl_crawler if block_given? @dsl_crawler end |
#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extractor using Wgit::Document.define_extractor underneath.
43 44 45 |
# File 'lib/wgit/dsl.rb', line 43 def extract(var, xpath, opts = {}, &block) Wgit::Document.define_extractor(var, xpath, opts, &block) end |
#follow(xpath) ⇒ Object
Sets the xpath to be followed when crawl_site or index_site is
subsequently called. Calling this method is optional as the default is to
follow all <a> href's that point to the site domain. You can also pass
follow: to the crawl/index methods directly.
80 81 82 |
# File 'lib/wgit/dsl.rb', line 80 def follow(xpath) @dsl_follow = xpath end |
#index(*urls, connection_string: @dsl_conn_str, insert_externals: false) {|doc| ... } ⇒ Object
Indexes a single webpage using Wgit::Indexer#index_url underneath.
253 254 255 256 257 258 259 260 261 262 263 264 265 |
# File 'lib/wgit/dsl.rb', line 253 def index( *urls, connection_string: @dsl_conn_str, insert_externals: false, &block ) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? db = Wgit::Database.new(connection_string) indexer = Wgit::Indexer.new(db, crawler) urls.map! { |url| Wgit::Url.parse(url) } indexer.index_urls(*urls, insert_externals: insert_externals, &block) end |
#index_site(*urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r
Indexes a single website using Wgit::Indexer#index_site underneath.
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
# File 'lib/wgit/dsl.rb', line 217 def index_site( *urls, connection_string: @dsl_conn_str, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil, &block ) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? db = Wgit::Database.new(connection_string) indexer = Wgit::Indexer.new(db, crawler) xpath = follow || :default crawl_opts = { insert_externals: insert_externals, follow: xpath, allow_paths: allow_paths, disallow_paths: disallow_paths } urls.reduce(0) do |total, url| total + indexer.index_site(Wgit::Url.parse(url), **crawl_opts, &block) end end |
#index_www(connection_string: @dsl_conn_str, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Indexes the World Wide Web using Wgit::Indexer#index_www underneath.
186 187 188 189 190 191 192 193 |
# File 'lib/wgit/dsl.rb', line 186 def index_www( connection_string: @dsl_conn_str, max_sites: -1, max_data: 1_048_576_000 ) db = Wgit::Database.new(connection_string) indexer = Wgit::Indexer.new(db, crawler) indexer.index_www(max_sites: max_sites, max_data: max_data) end |
#last_response ⇒ Wgit::Response
Returns the DSL's crawler#last_response.
150 151 152 |
# File 'lib/wgit/dsl.rb', line 150 def last_response crawler.last_response end |
#reset ⇒ Object
Nilifies the DSL instance variables.
155 156 157 158 159 160 |
# File 'lib/wgit/dsl.rb', line 155 def reset @dsl_crawler = nil @dsl_start = nil @dsl_follow = nil @dsl_conn_str = nil end |
#search(query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Performs a search of the database's indexed documents and pretty prints
the results in a search engine-esque format. See Wgit::Database#search!
and Wgit::Document#search! for details of how the search works.
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
# File 'lib/wgit/dsl.rb', line 287 def search( query, connection_string: @dsl_conn_str, stream: STDOUT, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, &block ) stream ||= File.open(File::NULL, 'w') db = Wgit::Database.new(connection_string) results = db.search!( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, limit: limit, skip: skip, sentence_limit: sentence_limit, &block ) Wgit::Utils.printf_search_results(results, stream: stream) results end |
#start(*urls) {|crawler| ... } ⇒ Object Also known as: start_urls
Sets the URL to be crawled when a crawl* or index* method is
subsequently called. Calling this is optional as the URL can be
passed to the method instead. You can also omit the url param and just
use the block to configure the crawler instead.
68 69 70 71 |
# File 'lib/wgit/dsl.rb', line 68 def start(*urls, &block) crawler(&block) @dsl_start = urls end |