Class: Wgit::WebCrawler
- Inherits:
-
Object
- Object
- Wgit::WebCrawler
- Defined in:
- lib/wgit/web_crawler.rb
Overview
Class which sets up a crawler and saves the indexed docs to a database. Will crawl the web forever if you let it :-)
Instance Attribute Summary collapse
-
#crawler ⇒ Object
readonly
Returns the value of attribute crawler.
-
#db ⇒ Object
readonly
Returns the value of attribute db.
-
#max_data_size ⇒ Object
Returns the value of attribute max_data_size.
-
#max_sites_to_crawl ⇒ Object
Returns the value of attribute max_sites_to_crawl.
Instance Method Summary collapse
-
#crawl_the_web ⇒ Object
Retrieves url’s from the database and recursively crawls each site storing their internal pages into the database and adding their external url’s to be crawled at a later date.
-
#initialize(database, max_sites_to_crawl = -1,, max_data_size = 1048576000) ⇒ WebCrawler
constructor
A new instance of WebCrawler.
Constructor Details
#initialize(database, max_sites_to_crawl = -1,, max_data_size = 1048576000) ⇒ WebCrawler
Returns a new instance of WebCrawler.
24 25 26 27 28 29 30 31 |
# File 'lib/wgit/web_crawler.rb', line 24 def initialize(database, max_sites_to_crawl = -1, max_data_size = 1048576000) @crawler = Wgit::Crawler.new @db = database @max_sites_to_crawl = max_sites_to_crawl @max_data_size = max_data_size end |
Instance Attribute Details
#crawler ⇒ Object (readonly)
Returns the value of attribute crawler.
22 23 24 |
# File 'lib/wgit/web_crawler.rb', line 22 def crawler @crawler end |
#db ⇒ Object (readonly)
Returns the value of attribute db.
22 23 24 |
# File 'lib/wgit/web_crawler.rb', line 22 def db @db end |
#max_data_size ⇒ Object
Returns the value of attribute max_data_size.
21 22 23 |
# File 'lib/wgit/web_crawler.rb', line 21 def max_data_size @max_data_size end |
#max_sites_to_crawl ⇒ Object
Returns the value of attribute max_sites_to_crawl.
21 22 23 |
# File 'lib/wgit/web_crawler.rb', line 21 def max_sites_to_crawl @max_sites_to_crawl end |
Instance Method Details
#crawl_the_web ⇒ Object
Retrieves url’s from the database and recursively crawls each site storing their internal pages into the database and adding their external url’s to be crawled at a later date.
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/wgit/web_crawler.rb', line 36 def crawl_the_web if max_sites_to_crawl < 0 puts "Crawling until the database has been filled or it runs out of \ urls to crawl (which might be never)." end loop_count = 0 while keep_crawling?(loop_count) do puts "Current database size: #{db.size}" crawler.urls = db.uncrawled_urls if crawler.urls.empty? puts "No urls to crawl, exiting." break end puts "Starting crawl loop for: #{crawler.urls}" docs_count = 0 urls_count = 0 crawler.urls.each do |url| unless keep_crawling?(loop_count) puts "Reached max number of sites to crawl or database \ capacity, exiting." return end loop_count += 1 url.crawled = true raise unless db.update(url) == 1 site_docs_count = 0 ext_links = crawler.crawl_site(url) do |doc| unless doc.empty? if write_doc_to_db(doc) docs_count += 1 site_docs_count += 1 end end end urls_count += write_urls_to_db(ext_links) puts "Crawled and saved #{site_docs_count} docs for the \ site: #{url}" end puts "Crawled and saved docs for #{docs_count} url(s) overall for \ this iteration." puts "Found and saved #{urls_count} external url(s) for the next \ iteration." end end |