Class: Scruber::Core::Crawler

Inherits:
Object
  • Object
show all
Defined in:
lib/scruber/core/crawler.rb

Overview

Crawler class

Main class-runner for scrapers.

Examples:

Simple scraper

Scruber::Core::Crawler.new(:simple) do
  get 'http://example.com'
  parse :html do |page,html|
    puts html.at('title').text
  end
end

Author:

  • Ivan Goncharov

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(*args) ⇒ Scruber::Core::Crawler

Initialize crawler with scraper name and/or with options

Crawler.new(:sample, fetcher_adapter: :custom)
Crawler.new(:sample)
Crawler.new(fetcher_adapter: :custom)

Parameters:

  • args (Array)

    if first arg is a Symbol, it will be used as scraper_name, hash will me used as configuration options (see Scruber::Core::Configuration)

Raises:



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/scruber/core/crawler.rb', line 31

def initialize(*args)
  if args.first.is_a?(Hash)
    scraper_name = nil
    @options = args.first
  else
    scraper_name, @options = args
    @options ||= {}
  end
  @scraper_name = scraper_name.present? ? scraper_name : ENV['SCRUBER_SCRAPER_NAME']
  raise Scruber::ArgumentError.new("Scraper name is empty. Pass it to `Scruber.run :name do` or through ENV['SCRUBER_SCRAPER_NAME']") if @scraper_name.blank?
  @scraper_name = @scraper_name.to_sym
  @callbacks_options = {}
  @callbacks = {}
  @on_page_error_callback = nil
  @on_complete_callbacks = []

  Scruber.configuration.merge_options(@options)
  ActiveSupport::Dependencies.autoload_paths = Scruber.configuration.autoload_paths

  @queue = Scruber::Queue.new(scraper_name: @scraper_name)
  @fetcher = Scruber::Fetcher.new
  initialize_progressbar
  load_extenstions
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_sym, *arguments, &block) ⇒ type

Method missing callback. Scruber allows to register regexp and proc body to process calls

Parameters:

  • method_sym (Symbol)

    missing method name

  • arguments (Array)

    arguments

  • block (Proc)

    block (if passed)

Returns:

  • (type)
    description


106
107
108
109
110
111
112
113
# File 'lib/scruber/core/crawler.rb', line 106

def method_missing(method_sym, *arguments, &block)
  Scruber::Core::Crawler._registered_method_missings.each do |(pattern, func)|
    if (scan_results = method_sym.to_s.scan(pattern)).present?
      return instance_exec(method_sym, scan_results, arguments+[block], &(func))
    end
  end
  super
end

Instance Attribute Details

#fetcherObject (readonly)

Returns the value of attribute fetcher.



19
20
21
# File 'lib/scruber/core/crawler.rb', line 19

def fetcher
  @fetcher
end

#queueObject (readonly)

Returns the value of attribute queue.



19
20
21
# File 'lib/scruber/core/crawler.rb', line 19

def queue
  @queue
end

#scraper_nameObject (readonly)

Returns the value of attribute scraper_name.



19
20
21
# File 'lib/scruber/core/crawler.rb', line 19

def scraper_name
  @scraper_name
end

Class Method Details

._registered_method_missingsHash

Registered method missing callbacks dictionary

Returns:

  • (Hash)

    callbacks



142
143
144
# File 'lib/scruber/core/crawler.rb', line 142

def _registered_method_missings
  @registered_method_missings ||= {}
end

.register_method_missing(pattern, &block) ⇒ void

This method returns an undefined value.

Register method missing callback

Parameters:

  • pattern (Regexp)

    Regexp to match missing name

  • block (Proc)

    Body to process missing method



134
135
136
# File 'lib/scruber/core/crawler.rb', line 134

def register_method_missing(pattern, &block)
  _registered_method_missings[pattern] = block
end

Instance Method Details

#on_complete(priority = 1, &block) ⇒ void

This method returns an undefined value.

Register callback which will be executed when downloading and parsing will be completed. For example when you need to write results to file, or to close files.

Examples:

Close file descriptors

on_complete -1 do
  Scruber::Core::Extensions::CsvOutput.close_all
end

Parameters:

  • priority (Integer) (defaults to: 1)

    priority of this callback

  • block (Proc)

    body of callback



161
162
163
# File 'lib/scruber/core/crawler.rb', line 161

def on_complete(priority=1, &block)
  @on_complete_callbacks.push [priority,block]
end

#on_page_error(&block) ⇒ void

This method returns an undefined value.

Register callback which will be executed for error pages, like 404 or 500 Attention! You should call one of these methods for page to prevent infinite loop: page.processed!, page.delete, page.redownload!(0)

Examples:

Processing error page

on_page_error do |page|
  if page.response_body =~ /distil/
    page.page.redownload!(0)
  elsif page.response_code == /404/
    get page.at('a.moved_to').attr('href')
    page.processed!
  else
    page.delete
  end
end

Parameters:

  • block (Proc)

    body of callback



185
186
187
# File 'lib/scruber/core/crawler.rb', line 185

def on_page_error(&block)
  @on_page_error_callback = block
end

#parser(page_type, options = {}, &block) ⇒ void

This method returns an undefined value.

Register parser

Parameters:

  • page_type (Symbol)

    type of page

  • options (Hash) (defaults to: {})

    options for parser

  • block (Proc)

    body of parser

Options Hash (options):

  • :format (Symbol)

    format of page. Scruber automatically process page body depends on this format. For example :json or :html



93
94
95
# File 'lib/scruber/core/crawler.rb', line 93

def parser(page_type, options={}, &block)
  register_callback(page_type, options, &block)
end

#respond_to?(method_sym, include_private = false) ⇒ Boolean

Returns:

  • (Boolean)


115
116
117
118
119
120
121
122
123
# File 'lib/scruber/core/crawler.rb', line 115

def respond_to?(method_sym, include_private = false)
  !Scruber::Core::Crawler._registered_method_missings.find do |(pattern, block)|
    if method_sym.to_s =~ pattern
      true
    else
      false
    end
  end.nil? || super(method_sym, include_private)
end

#run(&block) ⇒ Object

Crawling engine

Parameters:

  • block (Proc)

    crawler body



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/scruber/core/crawler.rb', line 60

def run(&block)
  instance_eval &block
  while @queue.has_work? do
    @fetcher.run @queue
    show_progress
    while page = @queue.fetch_downloaded do
      if @callbacks[page.page_type.to_sym]
        processed_page = process_page(page, page.page_type.to_sym)
        instance_exec page, processed_page, &(@callbacks[page.page_type.to_sym])
        page.processed! unless page.sent_to_redownload?
      end
    end
    if @on_page_error_callback
      while page = @queue.fetch_error do
        instance_exec page, &(@on_page_error_callback)
      end
    end
  end
  @on_complete_callbacks.sort_by{|c| -c[0] }.map do |(_,callback)|
    instance_exec &(callback)
  end.first
end