Class: ExtraLoop::IterativeScraper

Inherits:

ScraperBase

Object
ScraperBase
ExtraLoop::IterativeScraper

show all

Defined in:: lib/extraloop/iterative_scraper.rb

Defined Under Namespace

Modules: Exceptions

Instance Attribute Summary

Attributes inherited from ScraperBase

#options, #results

Instance Method Summary collapse

#continue_with(param, *extractor_args, &block) ⇒ Object

Public.
#initialize(urls, options = {}, arguments = {}) ⇒ IterativeScraper constructor

Public.
#run ⇒ Object
#set_iteration(param, *args, &block) ⇒ Object

Public.

Constructor Details

#initialize(urls, options = {}, arguments = {}) ⇒ `IterativeScraper`

Public

Initializes an iterative scraper (i.e. a scraper which can extract data from a list of several web pages).

urls - One or an array of several urls. options - A hash of scraper options (optional).

async : Wether or not the scraper should issue HTTP requests synchronously or asynchronously (defaults to false).
log   : Logging options (set to false to completely suppress logging).
hydra : A list of arguments to be passed in when initializing the HTTP queue (see Typheous#Hydra).

arguments - Hash of arguments to be passed to the Typhoeus HTTP client (optional).

Examples:

# Iterates over the first 10 pages of Google News search result for the query ‘Egypt’.

IterativeScraper.new(“www.google.com/search?tbm=nws&q=Egypt”, :log => {

  :appenders => [ 'example.log', :stderr],
  :log_level => :debug

}).set_iteration(:start, (1..101).step(10))

# Iterates over the first 10 pages of Google News search results for the query ‘Egypt’ first, and then # for the query ‘Syria’, issuing HTTP requests asynchronously, and ignoring ssl certificate verification.

IterativeScraper.new([

  https://www.google.com/search?tbm=nws&q=Egypt",
  https://www.google.com/search?tbm=nws&q=Syria"
], {:async => true,  }, {:disable_ssl_peer_verification => true

}).set_iteration(:start, (1..101).step(10))

Returns itself.

# File 'lib/extraloop/iterative_scraper.rb', line 43

def initialize(urls, options = {}, arguments = {})
  super([], options, arguments)

  @base_urls = Array(urls)
  @iteration_set = []
  @iteration_extractor = nil
  @iteration_extractor_args = nil
  @iteration_count = 0
  @iteration_param = nil
  @iteration_param_value = nil
  @continue_clause_args = nil
  self
end

Instance Method Details

#continue_with(param, *extractor_args, &block) ⇒ `Object`

Public

Builds an extractor and uses it to set the value of the next iteration’s offset parameter. If the extractor returns nil, the iteration stops.

param - A symbol identifying the itertion parameter name. extractor_args - Arguments to be passed to the extractor which will be used to evaluate the continue value

Returns itself.

# File 'lib/extraloop/iterative_scraper.rb', line 113

def continue_with(param, *extractor_args, &block)
  extractor_args << block if block
  raise Exceptions::NonGetAsyncRequestNotYetImplemented.new "the #continue_with method currently requires the 'async' option to be set to false" if @options[:async]

  @continue_clause_args = extractor_args
  set_iteration_param(param)
  self
end

#run ⇒ `Object`

# File 'lib/extraloop/iterative_scraper.rb', line 122

def run
  @base_urls.each do |base_url|

    # run an extra iteration to determine the value of the next offset parameter (if #continue_with is used)
    # or the entire iteration set (if #set_iteration is used).
    (run_iteration(base_url); @iteration_count += 1 ) if @iteration_extractor_args || @continue_clause_args

    while @iteration_set.at(@iteration_count)
      method = @options[:async] ? :run_iteration_async : :run_iteration
      send(method, base_url)
      @iteration_count += 1
    end

    #reset all counts
    @queued_count = 0
    @response_count = 0
    @iteration_count = 0
  end
  self
end

#set_iteration(param, *args, &block) ⇒ `Object`

Public

Specifies the collection of values over which the scraper should iterate. At each iteration, the current value in the iteration set will be included as part of the request parameters.

param - the name of the iteration parameter. args - Either an array of values, or a set the arguments to initialize an Extractor object.

Examples:

# Explicitly specify the iteration set (can be either a range or an array).

 IterativeScraper.new("http://my-site.com/events").
   set_iteration(:p, 1..10).

# Pass in a code block to dynamically extract the iteration set from the document.
# The code block will be passed to generate an Extractor that will be run at the first
# iteration. The iteration will not continue if the proc will return return a non empty 
# set of values.

fetch_page_numbers = proc { |elements|
  elements.map { |a|
     a.attr(:href).match(/p=(\d+)/)
     $1
  }.reject { |p| p == 1 }
}

IterativeScraper.new("http://my-site.com/events").
  set_iteration(:p, "div#pagination a", fetch_page_numbers)

Returns itself.

# File 'lib/extraloop/iterative_scraper.rb', line 92

def set_iteration(param, *args, &block)
  args << block if block
  if args.first.respond_to?(:map)
    @iteration_set = Array(args.first).map &:to_s
  else
    @iteration_extractor_args = [:pagination, *args]
  end
  set_iteration_param(param)
  self
end

Class: ExtraLoop::IterativeScraper

Defined Under Namespace

Instance Attribute Summary

Attributes inherited from ScraperBase

Instance Method Summary collapse

Methods inherited from ScraperBase

Methods included from Utils::Support

Methods included from Hookable

Constructor Details

#initialize(urls, options = {}, arguments = {}) ⇒ IterativeScraper

Instance Method Details

#continue_with(param, *extractor_args, &block) ⇒ Object

#run ⇒ Object

#set_iteration(param, *args, &block) ⇒ Object

#initialize(urls, options = {}, arguments = {}) ⇒ `IterativeScraper`

#continue_with(param, *extractor_args, &block) ⇒ `Object`

#run ⇒ `Object`

#set_iteration(param, *args, &block) ⇒ `Object`