Class: Wgit::WebCrawler

Inherits:

Object

Object
Wgit::WebCrawler

show all

Defined in:: lib/wgit/web_crawler.rb

Overview

Class which sets up a crawler and saves the indexed docs to a database. Will crawl the web forever if you let it :-)

Instance Attribute Summary collapse

#crawler ⇒ Object readonly

Returns the value of attribute crawler.
#db ⇒ Object readonly

Returns the value of attribute db.
#max_data_size ⇒ Object

Returns the value of attribute max_data_size.
#max_sites_to_crawl ⇒ Object

Returns the value of attribute max_sites_to_crawl.

Instance Method Summary collapse

#crawl_the_web ⇒ Object

Retrieves url’s from the database and recursively crawls each site storing their internal pages into the database and adding their external url’s to be crawled at a later date.
#initialize(database, max_sites_to_crawl = -1,, max_data_size = 1048576000) ⇒ WebCrawler constructor

A new instance of WebCrawler.

Constructor Details

#initialize(database, max_sites_to_crawl = -1,, max_data_size = 1048576000) ⇒ `WebCrawler`

Returns a new instance of WebCrawler.

# File 'lib/wgit/web_crawler.rb', line 24

def initialize(database, 
               max_sites_to_crawl = -1, 
               max_data_size = 1048576000)
  @crawler = Wgit::Crawler.new
  @db = database
  @max_sites_to_crawl = max_sites_to_crawl
  @max_data_size = max_data_size
end

Instance Attribute Details

#crawler ⇒ `Object` (readonly)

Returns the value of attribute crawler.



22
23
24

# File 'lib/wgit/web_crawler.rb', line 22

def crawler
  @crawler
end

#db ⇒ `Object` (readonly)

Returns the value of attribute db.



22
23
24

# File 'lib/wgit/web_crawler.rb', line 22

def db
  @db
end

#max_data_size ⇒ `Object`

Returns the value of attribute max_data_size.



21
22
23

# File 'lib/wgit/web_crawler.rb', line 21

def max_data_size
  @max_data_size
end

#max_sites_to_crawl ⇒ `Object`

Returns the value of attribute max_sites_to_crawl.



21
22
23

# File 'lib/wgit/web_crawler.rb', line 21

def max_sites_to_crawl
  @max_sites_to_crawl
end

Instance Method Details

#crawl_the_web ⇒ `Object`

Retrieves url’s from the database and recursively crawls each site storing their internal pages into the database and adding their external url’s to be crawled at a later date.

# File 'lib/wgit/web_crawler.rb', line 36

def crawl_the_web
  if max_sites_to_crawl < 0
    puts "Crawling until the database has been filled or it runs out of \
urls to crawl (which might be never)."
  end
  loop_count = 0
  
  while keep_crawling?(loop_count) do
      puts "Current database size: #{db.size}"
      crawler.urls = db.uncrawled_urls

      if crawler.urls.empty?
          puts "No urls to crawl, exiting."
          break
      end
      puts "Starting crawl loop for: #{crawler.urls}"
  
      docs_count = 0
      urls_count = 0
  
      crawler.urls.each do |url|
        unless keep_crawling?(loop_count)
          puts "Reached max number of sites to crawl or database \
capacity, exiting."
          return
        end
        loop_count += 1

        url.crawled = true
        raise unless db.update(url) == 1
    
        site_docs_count = 0
        ext_links = crawler.crawl_site(url) do |doc|
            unless doc.empty?
                if write_doc_to_db(doc)
                    docs_count += 1
                    site_docs_count += 1
                end
            end
        end
    
        urls_count += write_urls_to_db(ext_links)
        puts "Crawled and saved #{site_docs_count} docs for the \
site: #{url}"
      end
  
      puts "Crawled and saved docs for #{docs_count} url(s) overall for \
this iteration."
      puts "Found and saved #{urls_count} external url(s) for the next \
iteration."
  end
end

Class: Wgit::WebCrawler

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(database, max_sites_to_crawl = -1,, max_data_size = 1048576000) ⇒ WebCrawler

Instance Attribute Details

#crawler ⇒ Object (readonly)

#db ⇒ Object (readonly)

#max_data_size ⇒ Object

#max_sites_to_crawl ⇒ Object