Class: Wgit::Indexer
- Inherits:
-
Object
- Object
- Wgit::Indexer
- Defined in:
- lib/wgit/indexer.rb
Overview
Class which crawls and saves the Documents to a database. Can be thought of as a combination of Wgit::Crawler and Wgit::Database.
Instance Attribute Summary collapse
-
#crawler ⇒ Object
readonly
The crawler used to index the WWW.
-
#db ⇒ Object
(also: #database)
readonly
The database instance used to store Urls and Documents in.
Instance Method Summary collapse
-
#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
(also: #index_r)
Crawls a single website's pages and stores them into the database.
-
#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object
Crawls a single webpage and stores it into the database.
-
#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object
(also: #index)
Crawls one or more webpages and stores them into the database.
-
#index_www(max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on.
-
#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer
constructor
Initialize the Indexer.
-
#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean
protected
Returns whether or not to keep crawling based on the DB size and current loop iteration.
-
#write_doc_to_db(doc) ⇒ Object
protected
Write the doc to the DB.
-
#write_urls_to_db(urls) ⇒ Integer
protected
Write the urls to the DB.
Constructor Details
#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer
Initialize the Indexer.
21 22 23 24 |
# File 'lib/wgit/indexer.rb', line 21 def initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) @db = database @crawler = crawler end |
Instance Attribute Details
#crawler ⇒ Object (readonly)
The crawler used to index the WWW.
11 12 13 |
# File 'lib/wgit/indexer.rb', line 11 def crawler @crawler end |
#db ⇒ Object (readonly) Also known as: database
The database instance used to store Urls and Documents in.
14 15 16 |
# File 'lib/wgit/indexer.rb', line 14 def db @db end |
Instance Method Details
#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r
Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index. Logs info on the crawl using Wgit.logger as it goes along.
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# File 'lib/wgit/indexer.rb', line 112 def index_site( url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil ) crawl_opts = { follow: follow, allow_paths: allow_paths, disallow_paths: disallow_paths } total_pages_indexed = 0 ext_urls = @crawler.crawl_site(url, **crawl_opts) do |doc| result = block_given? ? yield(doc) : true if result && !doc.empty? write_doc_to_db(doc) total_pages_indexed += 1 end end @db.upsert(url) if insert_externals && ext_urls num_inserted_urls = write_urls_to_db(ext_urls) Wgit.logger.info("Found and saved #{num_inserted_urls} external url(s)") end Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \ for the site: #{url}") total_pages_indexed end |
#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object
Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along.
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/wgit/indexer.rb', line 177 def index_url(url, insert_externals: false) document = @crawler.crawl_url(url) do |doc| result = block_given? ? yield(doc) : true write_doc_to_db(doc) if result && !doc.empty? end @db.upsert(url) ext_urls = document&.external_links if insert_externals && ext_urls num_inserted_urls = write_urls_to_db(ext_urls) Wgit.logger.info("Found and saved #{num_inserted_urls} external url(s)") end nil end |
#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index
Crawls one or more webpages and stores them into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along.
157 158 159 160 161 162 163 164 |
# File 'lib/wgit/indexer.rb', line 157 def index_urls(*urls, insert_externals: false, &block) raise 'You must provide at least one Url' if urls.empty? opts = { insert_externals: insert_externals } Wgit::Utils.each(urls) { |url| index_url(url, **opts, &block) } nil end |
#index_www(max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/wgit/indexer.rb', line 38 def index_www(max_sites: -1, max_data: 1_048_576_000) if max_sites.negative? Wgit.logger.info("Indexing until the database has been filled or it \ runs out of urls to crawl (which might be never).") end site_count = 0 while keep_crawling?(site_count, max_sites, max_data) Wgit.logger.info("Current database size: #{@db.size}") uncrawled_urls = @db.uncrawled_urls(limit: 100) if uncrawled_urls.empty? Wgit.logger.info('No urls to crawl, exiting.') return end Wgit.logger.info("Starting crawl loop for: #{uncrawled_urls}") docs_count = 0 urls_count = 0 uncrawled_urls.each do |url| unless keep_crawling?(site_count, max_sites, max_data) Wgit.logger.info("Reached max number of sites to crawl or \ database capacity, exiting.") return end site_count += 1 site_docs_count = 0 ext_links = @crawler.crawl_site(url) do |doc| unless doc.empty? write_doc_to_db(doc) docs_count += 1 site_docs_count += 1 end end raise 'Error updating url' unless @db.update(url) == 1 urls_count += write_urls_to_db(ext_links) end Wgit.logger.info("Crawled and indexed documents for #{docs_count} \ url(s) overall for this iteration.") Wgit.logger.info("Found and saved #{urls_count} external url(s) for \ the next iteration.") nil end end |
#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean (protected)
Returns whether or not to keep crawling based on the DB size and current loop iteration.
205 206 207 208 209 210 |
# File 'lib/wgit/indexer.rb', line 205 def keep_crawling?(site_count, max_sites, max_data) return false if @db.size >= max_data return true if max_sites.negative? site_count < max_sites end |
#write_doc_to_db(doc) ⇒ Object (protected)
Write the doc to the DB. Note that the unique url index on the documents collection deliberately prevents duplicate inserts.
216 217 218 219 220 221 222 |
# File 'lib/wgit/indexer.rb', line 216 def write_doc_to_db(doc) if @db.upsert(doc) Wgit.logger.info("Saved document for url: #{doc.url}") else Wgit.logger.info("Updated document for url: #{doc.url}") end end |
#write_urls_to_db(urls) ⇒ Integer (protected)
Write the urls to the DB. Note that the unique url index on the urls collection deliberately prevents duplicate inserts.
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
# File 'lib/wgit/indexer.rb', line 229 def write_urls_to_db(urls) count = 0 return count unless urls.respond_to?(:each) urls.each do |url| if url.invalid? Wgit.logger.info("Ignoring invalid external url: #{url}") next end @db.insert(url) count += 1 Wgit.logger.info("Inserted external url: #{url}") rescue Mongo::Error::OperationFailure Wgit.logger.info("External url already exists: #{url}") end count end |