Class: Grell::Crawler

Inherits:
Object
  • Object
show all
Defined in:
lib/grell/crawler.rb

Overview

This is the class that starts and controls the crawling

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options) ⇒ Crawler

Creates a crawler evaluate_in_each_page: javascript block to evaluate in each page we crawl add_match_block: block to evaluate to consider if a page is part of the collection manager_options: options passed to the manager class whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched. blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.



12
13
14
15
16
17
18
19
# File 'lib/grell/crawler.rb', line 12

def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options)
  @collection = nil
  @manager = CrawlerManager.new(manager_options)
  @evaluate_in_each_page = evaluate_in_each_page
  @add_match_block = add_match_block
  @whitelist_regexp = Regexp.union(whitelist)
  @blacklist_regexp = Regexp.union(blacklist)
end

Instance Attribute Details

#collectionObject (readonly)

Returns the value of attribute collection.



4
5
6
# File 'lib/grell/crawler.rb', line 4

def collection
  @collection
end

#managerObject (readonly)

Returns the value of attribute manager.



4
5
6
# File 'lib/grell/crawler.rb', line 4

def manager
  @manager
end

Instance Method Details

#crawl(site, block) ⇒ Object



35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/grell/crawler.rb', line 35

def crawl(site, block)
  Grell.logger.info "Visiting #{site.url}, visited_links: #{@collection.visited_pages.size}, discovered #{@collection.discovered_pages.size}"
  crawl_site(site)

  if block # The user of this block can send us a :retry to retry accessing the page
    while crawl_block(block, site) == :retry
      Grell.logger.info "Retrying our visit to #{site.url}"
      crawl_site(site)
    end
  end

  site.links.each do |url|
    @collection.create_page(url, site.id)
  end
end

#start_crawling(url, &block) ⇒ Object

Main method, it starts crawling on the given URL and calls a block for each of the pages found.



22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/grell/crawler.rb', line 22

def start_crawling(url, &block)
  Grell.logger.info "GRELL Started crawling"
  @collection = PageCollection.new(@add_match_block)
  @collection.create_page(url, nil)

  while !@collection.discovered_pages.empty?
    crawl(@collection.next_page, block)
    @manager.check_periodic_restart(@collection)
  end

  Grell.logger.info "GRELL finished crawling"
end