Class: Scruber::Core::Crawler
Overview
Crawler class
Main class-runner for scrapers.
Instance Attribute Summary collapse
-
#fetcher ⇒ Object
readonly
Returns the value of attribute fetcher.
-
#queue ⇒ Object
readonly
Returns the value of attribute queue.
-
#scraper_name ⇒ Object
readonly
Returns the value of attribute scraper_name.
Class Method Summary collapse
-
._registered_method_missings ⇒ Hash
Registered method missing callbacks dictionary.
-
.register_method_missing(pattern, &block) ⇒ void
Register method missing callback.
Instance Method Summary collapse
-
#initialize(*args) ⇒ Scruber::Core::Crawler
constructor
Initialize crawler with scraper name and/or with options.
-
#method_missing(method_sym, *arguments, &block) ⇒ type
Method missing callback.
-
#on_complete(priority = 1, &block) ⇒ void
Register callback which will be executed when downloading and parsing will be completed.
-
#on_page_error(&block) ⇒ void
Register callback which will be executed for error pages, like 404 or 500 Attention! You should call one of these methods for page to prevent infinite loop: page.processed!, page.delete, page.redownload!(0).
-
#parser(page_type, options = {}, &block) ⇒ void
Register parser.
- #respond_to?(method_sym, include_private = false) ⇒ Boolean
-
#run(&block) ⇒ Object
Crawling engine.
Constructor Details
#initialize(*args) ⇒ Scruber::Core::Crawler
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/scruber/core/crawler.rb', line 31 def initialize(*args) if args.first.is_a?(Hash) scraper_name = nil @options = args.first else scraper_name, @options = args @options ||= {} end @scraper_name = scraper_name.present? ? scraper_name : ENV['SCRUBER_SCRAPER_NAME'] raise Scruber::ArgumentError.new("Scraper name is empty. Pass it to `Scruber.run :name do` or through ENV['SCRUBER_SCRAPER_NAME']") if @scraper_name.blank? @scraper_name = @scraper_name.to_sym @callbacks_options = {} @callbacks = {} @on_page_error_callback = nil @on_complete_callbacks = [] Scruber.configuration.(@options) ActiveSupport::Dependencies.autoload_paths = Scruber.configuration.autoload_paths @queue = Scruber::Queue.new(scraper_name: @scraper_name) @fetcher = Scruber::Fetcher.new load_extenstions end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(method_sym, *arguments, &block) ⇒ type
Method missing callback. Scruber allows to register regexp and proc body to process calls
106 107 108 109 110 111 112 113 |
# File 'lib/scruber/core/crawler.rb', line 106 def method_missing(method_sym, *arguments, &block) Scruber::Core::Crawler._registered_method_missings.each do |(pattern, func)| if (scan_results = method_sym.to_s.scan(pattern)).present? return instance_exec(method_sym, scan_results, arguments+[block], &(func)) end end super end |
Instance Attribute Details
#fetcher ⇒ Object (readonly)
Returns the value of attribute fetcher.
19 20 21 |
# File 'lib/scruber/core/crawler.rb', line 19 def fetcher @fetcher end |
#queue ⇒ Object (readonly)
Returns the value of attribute queue.
19 20 21 |
# File 'lib/scruber/core/crawler.rb', line 19 def queue @queue end |
#scraper_name ⇒ Object (readonly)
Returns the value of attribute scraper_name.
19 20 21 |
# File 'lib/scruber/core/crawler.rb', line 19 def scraper_name @scraper_name end |
Class Method Details
._registered_method_missings ⇒ Hash
Registered method missing callbacks dictionary
142 143 144 |
# File 'lib/scruber/core/crawler.rb', line 142 def _registered_method_missings @registered_method_missings ||= {} end |
.register_method_missing(pattern, &block) ⇒ void
This method returns an undefined value.
Register method missing callback
134 135 136 |
# File 'lib/scruber/core/crawler.rb', line 134 def register_method_missing(pattern, &block) _registered_method_missings[pattern] = block end |
Instance Method Details
#on_complete(priority = 1, &block) ⇒ void
This method returns an undefined value.
Register callback which will be executed when downloading and parsing will be completed. For example when you need to write results to file, or to close files.
161 162 163 |
# File 'lib/scruber/core/crawler.rb', line 161 def on_complete(priority=1, &block) @on_complete_callbacks.push [priority,block] end |
#on_page_error(&block) ⇒ void
This method returns an undefined value.
Register callback which will be executed for error pages, like 404 or 500 Attention! You should call one of these methods for page to prevent infinite loop: page.processed!, page.delete, page.redownload!(0)
185 186 187 |
# File 'lib/scruber/core/crawler.rb', line 185 def on_page_error(&block) @on_page_error_callback = block end |
#parser(page_type, options = {}, &block) ⇒ void
This method returns an undefined value.
Register parser
93 94 95 |
# File 'lib/scruber/core/crawler.rb', line 93 def parser(page_type, ={}, &block) register_callback(page_type, , &block) end |
#respond_to?(method_sym, include_private = false) ⇒ Boolean
115 116 117 118 119 120 121 122 123 |
# File 'lib/scruber/core/crawler.rb', line 115 def respond_to?(method_sym, include_private = false) !Scruber::Core::Crawler._registered_method_missings.find do |(pattern, block)| if method_sym.to_s =~ pattern true else false end end.nil? || super(method_sym, include_private) end |
#run(&block) ⇒ Object
Crawling engine
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/scruber/core/crawler.rb', line 60 def run(&block) instance_eval &block while @queue.has_work? do @fetcher.run @queue show_progress while page = @queue.fetch_downloaded do if @callbacks[page.page_type.to_sym] processed_page = process_page(page, page.page_type.to_sym) instance_exec page, processed_page, &(@callbacks[page.page_type.to_sym]) page.processed! unless page.sent_to_redownload? end end if @on_page_error_callback while page = @queue.fetch_error do instance_exec page, &(@on_page_error_callback) end end end @on_complete_callbacks.sort_by{|c| -c[0] }.map do |(_,callback)| instance_exec &(callback) end.first end |