Class: Wgit::WebCrawler

Inherits:
Object
  • Object
show all
Defined in:
lib/wgit/web_crawler.rb

Overview

Class which sets up a crawler and saves the indexed docs to a database. Will crawl the web forever if you let it :-)

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(database, max_sites_to_crawl = -1,, max_data_size = 1048576000) ⇒ WebCrawler

Returns a new instance of WebCrawler.



24
25
26
27
28
29
30
31
# File 'lib/wgit/web_crawler.rb', line 24

def initialize(database, 
               max_sites_to_crawl = -1, 
               max_data_size = 1048576000)
  @crawler = Wgit::Crawler.new
  @db = database
  @max_sites_to_crawl = max_sites_to_crawl
  @max_data_size = max_data_size
end

Instance Attribute Details

#crawlerObject (readonly)

Returns the value of attribute crawler.



22
23
24
# File 'lib/wgit/web_crawler.rb', line 22

def crawler
  @crawler
end

#dbObject (readonly)

Returns the value of attribute db.



22
23
24
# File 'lib/wgit/web_crawler.rb', line 22

def db
  @db
end

#max_data_sizeObject

Returns the value of attribute max_data_size.



21
22
23
# File 'lib/wgit/web_crawler.rb', line 21

def max_data_size
  @max_data_size
end

#max_sites_to_crawlObject

Returns the value of attribute max_sites_to_crawl.



21
22
23
# File 'lib/wgit/web_crawler.rb', line 21

def max_sites_to_crawl
  @max_sites_to_crawl
end

Instance Method Details

#crawl_the_webObject

Retrieves url’s from the database and recursively crawls each site storing their internal pages into the database and adding their external url’s to be crawled at a later date.



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/wgit/web_crawler.rb', line 36

def crawl_the_web
  if max_sites_to_crawl < 0
    puts "Crawling until the database has been filled or it runs out of \
urls to crawl (which might be never)."
  end
  loop_count = 0
  
  while keep_crawling?(loop_count) do
      puts "Current database size: #{db.size}"
      crawler.urls = db.uncrawled_urls

      if crawler.urls.empty?
          puts "No urls to crawl, exiting."
          break
      end
      puts "Starting crawl loop for: #{crawler.urls}"
  
      docs_count = 0
      urls_count = 0
  
      crawler.urls.each do |url|
        unless keep_crawling?(loop_count)
          puts "Reached max number of sites to crawl or database \
capacity, exiting."
          return
        end
        loop_count += 1

        url.crawled = true
        raise unless db.update(url) == 1
    
        site_docs_count = 0
        ext_links = crawler.crawl_site(url) do |doc|
            unless doc.empty?
                if write_doc_to_db(doc)
                    docs_count += 1
                    site_docs_count += 1
                end
            end
        end
    
        urls_count += write_urls_to_db(ext_links)
        puts "Crawled and saved #{site_docs_count} docs for the \
site: #{url}"
      end
  
      puts "Crawled and saved docs for #{docs_count} url(s) overall for \
this iteration."
      puts "Found and saved #{urls_count} external url(s) for the next \
iteration."
  end
end