Class: Grell::CrawlerManager

Inherits:
Object
  • Object
show all
Defined in:
lib/grell/crawler_manager.rb

Overview

Manages the state of the process crawling, does not care about individual pages but about logging, restarting and quiting the crawler correctly.

Defined Under Namespace

Classes: PhantomJSManager

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(logger: nil, on_periodic_restart: {}, driver: nil) ⇒ CrawlerManager

logger: logger to use for Grell’s messages on_periodic_restart: if set, the driver will restart every :each visits (100 default) and execute the :do block driver_options: Any extra options for the Capybara driver



8
9
10
11
12
13
14
15
16
# File 'lib/grell/crawler_manager.rb', line 8

def initialize(logger: nil, on_periodic_restart: {}, driver: nil)
  Grell.logger = logger ? logger : Logger.new(STDOUT)
  @periodic_restart_block = on_periodic_restart[:do]
  @periodic_restart_period = on_periodic_restart[:each] || PAGES_TO_RESTART
  @driver = driver || CapybaraDriver.new.setup_capybara
  if @periodic_restart_period <= 0
    Grell.logger.warn "GRELL. Restart option misconfigured with a negative period. Ignoring option."
  end
end

Class Method Details

.cleanup_all_processesObject



41
42
43
# File 'lib/grell/crawler_manager.rb', line 41

def self.cleanup_all_processes
  PhantomJSManager.new.cleanup_all_processes
end

Instance Method Details

#check_periodic_restart(collection) ⇒ Object

PhantomJS seems to consume memory increasingly as it crawls, periodic restart allows to restart the driver, potentially calling a block.



33
34
35
36
37
38
39
# File 'lib/grell/crawler_manager.rb', line 33

def check_periodic_restart(collection)
  return unless @periodic_restart_block
  return unless @periodic_restart_period > 0
  return unless (collection.visited_pages.size % @periodic_restart_period).zero?
  restart
  @periodic_restart_block.call
end

#quitObject

Quits the poltergeist driver.



26
27
28
29
# File 'lib/grell/crawler_manager.rb', line 26

def quit
  Grell.logger.info "GRELL. Driver quitting"
  @driver.quit
end

#restartObject

Restarts the PhantomJS process without modifying the state of visited and discovered pages.



19
20
21
22
23
# File 'lib/grell/crawler_manager.rb', line 19

def restart
  Grell.logger.info "GRELL. Driver restarting"
  @driver.restart
  Grell.logger.info "GRELL. Driver restarted"
end