Class: ExtraLoop::ScraperBase

Inherits:
Object
  • Object
show all
Includes:
Hookable, Loggable, Utils::Support
Defined in:
lib/extraloop/loggable.rb,
lib/extraloop/scraper_base.rb

Overview

Monkey patches ScraperBase.

Direct Known Subclasses

IterativeScraper

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Utils::Support

#symbolize_keys

Methods included from Hookable

#run_hook, #set_hook

Constructor Details

#initialize(urls, options = {}, arguments = {}) ⇒ ScraperBase

Public: Initalizes a web scraper.

urls - One or several urls. options - Hash of scraper options

async        : Whether the scraper should issue HTTP requests in series or in parallel (set to false to suppress logging completely).
log          : logging options (defaults to standard error).
  appenders    : specifies where the log messages should be appended to (defaults to standard error).
  log_level    : specifies the log level (defaults to info).

arguments - Hash of arguments to be passed to the Typhoeus HTTP client (optional).

Returns itself.



59
60
61
62
63
# File 'lib/extraloop/loggable.rb', line 59

def initialize(*args)
  base_initialize(*args)
  init_log!
  self
end

Instance Attribute Details

#optionsObject (readonly)

Returns the value of attribute options.



6
7
8
# File 'lib/extraloop/scraper_base.rb', line 6

def options
  @options
end

#resultsObject (readonly)

Returns the value of attribute results.



6
7
8
# File 'lib/extraloop/scraper_base.rb', line 6

def results
  @results
end

Instance Method Details

#base_initializeObject



50
# File 'lib/extraloop/loggable.rb', line 50

alias_method :base_initialize, :initialize

#extract(*args, &block) ⇒ Object

Public: Registers a new extractor to be added to the loop.

Delegates to Extractor, will raise an exception if neither a selector, a block, or an attribute name is provided.

selector - The CSS3 selector identifying the node list over which iterate (optional). callback - A block of code (optional). attribute - An attribute name (optional).

Returns itself.



81
82
83
84
85
# File 'lib/extraloop/scraper_base.rb', line 81

def extract(*args, &block)
  args << block if block
  @extractor_args << args
  self
end

#loop_on(*args, &block) ⇒ Object

Public: Sets the scraper extraction loop.

Delegates to Extractor, will raise an exception if neither a selector, a block, or an attribute name is provided.

selector - The CSS3 selector identifying the node list over which iterate (optional). attribute - An attribute name (optional).

callback - A block of code (optional).

Returns itself.



62
63
64
65
66
67
# File 'lib/extraloop/scraper_base.rb', line 62

def loop_on(*args, &block)
  args << block if block
  # prepend placeholder values for loop name and extraction environment
  @loop_extractor_args = args.insert(0, nil, nil)
  self
end

#runObject

Public: Runs the main scraping loop.

Returns nothing



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/extraloop/scraper_base.rb', line 92

def run
  @urls.each do |url|
    issue_request(url)

    # if the scraper is asynchronous start processing the Hydra HTTP queue 
    # only after that the last url has been appended to the queue (see #issue_request).
    #
    if @options[:async]
      if url == @urls.last
        @hydra.run
      end
    else
      @hydra.run
    end
  end
  self
end