Class: Wgit::Indexer
- Inherits:
-
Object
- Object
- Wgit::Indexer
- Defined in:
- lib/wgit/indexer.rb
Overview
Class which crawls and saves the Documents to a database. Can be thought of as a combination of Wgit::Crawler and Wgit::Database.
Instance Attribute Summary collapse
-
#crawler ⇒ Object
readonly
The crawler used to index the WWW.
-
#db ⇒ Object
(also: #database)
readonly
The database instance used to store Urls and Documents in.
Instance Method Summary collapse
-
#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
(also: #index_r)
Crawls a single website's pages and stores them into the database.
-
#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object
Crawls a single webpage and stores it into the database.
-
#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object
(also: #index)
Crawls one or more webpages and stores them into the database.
-
#index_www(max_sites: -1,, max_data: 1_048_576_000, max_urls_per_iteration: 10) ⇒ Object
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on.
-
#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer
constructor
Initialize the Indexer.
-
#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean
protected
Returns whether or not to keep crawling based on the DB size and current loop iteration.
-
#upsert_doc(doc) ⇒ Object
protected
Write the doc to the DB.
-
#upsert_external_urls(urls) ⇒ Integer
protected
Write the external urls to the DB.
-
#upsert_url_and_redirects(url) ⇒ Integer
protected
Upsert the url and its redirects, setting all to crawled = true.
Constructor Details
Instance Attribute Details
#crawler ⇒ Object (readonly)
The crawler used to index the WWW.
11 12 13 |
# File 'lib/wgit/indexer.rb', line 11 def crawler @crawler end |
#db ⇒ Object (readonly) Also known as: database
The database instance used to store Urls and Documents in.
14 15 16 |
# File 'lib/wgit/indexer.rb', line 14 def db @db end |
Instance Method Details
#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r
Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests.
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'lib/wgit/indexer.rb', line 127 def index_site( url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil ) parser = parse_robots_txt(url) if parser&.no_index? upsert_url_and_redirects(url) return 0 end allow_paths, disallow_paths = merge_paths(parser, allow_paths, disallow_paths) crawl_opts = { follow:, allow_paths:, disallow_paths: } total_pages_indexed = 0 ext_urls = @crawler.crawl_site(url, **crawl_opts) do |doc| next if no_index?(@crawler.last_response, doc) result = block_given? ? yield(doc) : true if result && !doc.empty? upsert_doc(doc) total_pages_indexed += 1 end end upsert_url_and_redirects(url) upsert_external_urls(ext_urls) if insert_externals && ext_urls Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \ for the site: #{url}") total_pages_indexed end |
#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object
Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests in relation to the given url.
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
# File 'lib/wgit/indexer.rb', line 198 def index_url(url, insert_externals: false) parser = parse_robots_txt(url) if parser && (parser.no_index? || contains_path?(parser.disallow_paths, url)) upsert_url_and_redirects(url) return end document = @crawler.crawl_url(url) do |doc| break if no_index?(@crawler.last_response, doc) result = block_given? ? yield(doc) : true upsert_doc(doc) if result && !doc.empty? end upsert_url_and_redirects(url) ext_urls = document&.external_links upsert_external_urls(ext_urls) if insert_externals && ext_urls nil end |
#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index
Crawls one or more webpages and stores them into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests in relation to the given urls.
176 177 178 179 180 181 182 183 |
# File 'lib/wgit/indexer.rb', line 176 def index_urls(*urls, insert_externals: false, &block) raise 'You must provide at least one Url' if urls.empty? opts = { insert_externals: } Wgit::Utils.each(urls) { |url| index_url(url, **opts, &block) } nil end |
#index_www(max_sites: -1,, max_data: 1_048_576_000, max_urls_per_iteration: 10) ⇒ Object
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along. This method will honour all site's robots.txt and 'noindex' requests.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/wgit/indexer.rb', line 43 def index_www(max_sites: -1, max_data: 1_048_576_000, max_urls_per_iteration: 10) if max_sites.negative? Wgit.logger.info("Indexing until the database has been filled or it \ runs out of urls to crawl (which might be never)") end site_count = 0 while keep_crawling?(site_count, max_sites, max_data) Wgit.logger.info("Current database size: #{@db.size}") uncrawled_urls = @db.uncrawled_urls(limit: max_urls_per_iteration) if uncrawled_urls.empty? Wgit.logger.info('No urls to crawl, exiting') return end Wgit.logger.info("Starting indexing loop for: #{uncrawled_urls.map(&:to_s)}") docs_count = 0 urls_count = 0 uncrawled_urls.each do |url| unless keep_crawling?(site_count, max_sites, max_data) Wgit.logger.info("Reached max number of sites to crawl or \ database capacity, exiting") return end site_count += 1 parser = parse_robots_txt(url) if parser&.no_index? upsert_url_and_redirects(url) next end site_docs_count = 0 ext_links = @crawler.crawl_site( url, allow_paths: parser&.allow_paths, disallow_paths: parser&.disallow_paths ) do |doc| next if doc.empty? || no_index?(@crawler.last_response, doc) upsert_doc(doc) docs_count += 1 site_docs_count += 1 end upsert_url_and_redirects(url) urls_count += upsert_external_urls(ext_links) end Wgit.logger.info("Crawled and indexed documents for #{docs_count} \ url(s) during this iteration") Wgit.logger.info("Found and saved #{urls_count} external url(s) for \ future iterations") end nil end |
#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean (protected)
Returns whether or not to keep crawling based on the DB size and current loop iteration.
232 233 234 235 236 237 |
# File 'lib/wgit/indexer.rb', line 232 def keep_crawling?(site_count, max_sites, max_data) return false if @db.size >= max_data return true if max_sites.negative? site_count < max_sites end |
#upsert_doc(doc) ⇒ Object (protected)
Write the doc to the DB. Note that the unique url index on the documents collection deliberately prevents duplicate inserts. If the document already exists, then it will be updated in the DB.
244 245 246 247 248 249 250 |
# File 'lib/wgit/indexer.rb', line 244 def upsert_doc(doc) if @db.upsert(doc) Wgit.logger.info("Saved document for url: #{doc.url}") else Wgit.logger.info("Updated document for url: #{doc.url}") end end |
#upsert_external_urls(urls) ⇒ Integer (protected)
Write the external urls to the DB. For any external url, its origin will be inserted e.g. if the external url is http://example.com/contact then http://example.com will be inserted into the database. Note that the unique url index on the urls collection deliberately prevents duplicate inserts.
271 272 273 274 275 276 277 278 279 280 281 282 |
# File 'lib/wgit/indexer.rb', line 271 def upsert_external_urls(urls) urls = urls .reject(&:invalid?) .map(&:to_origin) .uniq return 0 if urls.empty? count = @db.bulk_upsert(urls) Wgit.logger.info("Saved #{count} external urls") count end |
#upsert_url_and_redirects(url) ⇒ Integer (protected)
Upsert the url and its redirects, setting all to crawled = true.
256 257 258 259 260 261 |
# File 'lib/wgit/indexer.rb', line 256 def upsert_url_and_redirects(url) url.crawled = true unless url.crawled? # Upsert the url and any url redirects, setting them as crawled also. @db.bulk_upsert(url.redirects_journey) end |