Class: Grell::Crawler
- Inherits:
-
Object
- Object
- Grell::Crawler
- Defined in:
- lib/grell/crawler.rb
Overview
This is the class that starts and controls the crawling
Instance Attribute Summary collapse
-
#collection ⇒ Object
readonly
Returns the value of attribute collection.
-
#manager ⇒ Object
readonly
Returns the value of attribute manager.
Instance Method Summary collapse
- #crawl(site, block) ⇒ Object
-
#initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options) ⇒ Crawler
constructor
Creates a crawler evaluate_in_each_page: javascript block to evaluate in each page we crawl add_match_block: block to evaluate to consider if a page is part of the collection manager_options: options passed to the manager class whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched.
-
#start_crawling(url, &block) ⇒ Object
Main method, it starts crawling on the given URL and calls a block for each of the pages found.
Constructor Details
#initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options) ⇒ Crawler
Creates a crawler evaluate_in_each_page: javascript block to evaluate in each page we crawl add_match_block: block to evaluate to consider if a page is part of the collection manager_options: options passed to the manager class whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched. blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.
12 13 14 15 16 17 18 19 |
# File 'lib/grell/crawler.rb', line 12 def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **) @collection = nil @manager = CrawlerManager.new() @evaluate_in_each_page = evaluate_in_each_page @add_match_block = add_match_block @whitelist_regexp = Regexp.union(whitelist) @blacklist_regexp = Regexp.union(blacklist) end |
Instance Attribute Details
#collection ⇒ Object (readonly)
Returns the value of attribute collection.
4 5 6 |
# File 'lib/grell/crawler.rb', line 4 def collection @collection end |
#manager ⇒ Object (readonly)
Returns the value of attribute manager.
4 5 6 |
# File 'lib/grell/crawler.rb', line 4 def manager @manager end |
Instance Method Details
#crawl(site, block) ⇒ Object
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/grell/crawler.rb', line 35 def crawl(site, block) Grell.logger.info "Visiting #{site.url}, visited_links: #{@collection.visited_pages.size}, discovered #{@collection.discovered_pages.size}" crawl_site(site) if block # The user of this block can send us a :retry to retry accessing the page while crawl_block(block, site) == :retry Grell.logger.info "Retrying our visit to #{site.url}" crawl_site(site) end end site.links.each do |url| @collection.create_page(url, site.id) end end |
#start_crawling(url, &block) ⇒ Object
Main method, it starts crawling on the given URL and calls a block for each of the pages found.
22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'lib/grell/crawler.rb', line 22 def start_crawling(url, &block) Grell.logger.info "GRELL Started crawling" @collection = PageCollection.new(@add_match_block) @collection.create_page(url, nil) while !@collection.discovered_pages.empty? crawl(@collection.next_page, block) @manager.check_periodic_restart(@collection) end Grell.logger.info "GRELL finished crawling" end |