Class: ExtraLoop::IterativeScraper

Inherits:
ScraperBase show all
Defined in:
lib/extraloop/iterative_scraper.rb

Defined Under Namespace

Modules: Exceptions

Instance Attribute Summary

Attributes inherited from ScraperBase

#options, #results

Instance Method Summary collapse

Methods inherited from ScraperBase

#base_initialize, #extract, #loop_on

Methods included from Utils::Support

#symbolize_keys

Methods included from Hookable

#run_hook, #set_hook

Constructor Details

#initialize(urls, options = {}, arguments = {}) ⇒ IterativeScraper

Public

Initializes an iterative scraper (i.e. a scraper which can extract data from a list of several web pages).

urls - One or an array of several urls. options - A hash of scraper options (optional).

async : Wether or not the scraper should issue HTTP requests synchronously or asynchronously (defaults to false).
log   : Logging options (set to false to completely suppress logging).
hydra : A list of arguments to be passed in when initializing the HTTP queue (see Typheous#Hydra).

arguments - Hash of arguments to be passed to the Typhoeus HTTP client (optional).

Examples:

# Iterates over the first 10 pages of Google News search result for the query ‘Egypt’.

IterativeScraper.new(“www.google.com/search?tbm=nws&q=Egypt”, :log => {

  :appenders => [ 'example.log', :stderr],
  :log_level => :debug

}).set_iteration(:start, (1..101).step(10))

# Iterates over the first 10 pages of Google News search results for the query ‘Egypt’ first, and then # for the query ‘Syria’, issuing HTTP requests asynchronously, and ignoring ssl certificate verification.

IterativeScraper.new([

  https://www.google.com/search?tbm=nws&q=Egypt",
  https://www.google.com/search?tbm=nws&q=Syria"
], {:async => true,  }, {:disable_ssl_peer_verification => true

}).set_iteration(:start, (1..101).step(10))

Returns itself.



43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/extraloop/iterative_scraper.rb', line 43

def initialize(urls, options = {}, arguments = {})
  super([], options, arguments)

  @base_urls = Array(urls)
  @iteration_set = []
  @iteration_extractor = nil
  @iteration_extractor_args = nil
  @iteration_count = 0
  @iteration_param = nil
  @iteration_param_value = nil
  @continue_clause_args = nil
  self
end

Instance Method Details

#continue_with(param, *extractor_args, &block) ⇒ Object

Public

Builds an extractor and uses it to set the value of the next iteration’s offset parameter. If the extractor returns nil, the iteration stops.

param - A symbol identifying the itertion parameter name. extractor_args - Arguments to be passed to the extractor which will be used to evaluate the continue value

Returns itself.



113
114
115
116
117
118
119
120
# File 'lib/extraloop/iterative_scraper.rb', line 113

def continue_with(param, *extractor_args, &block)
  extractor_args << block if block
  raise Exceptions::NonGetAsyncRequestNotYetImplemented.new "the #continue_with method currently requires the 'async' option to be set to false" if @options[:async]

  @continue_clause_args = extractor_args
  set_iteration_param(param)
  self
end

#runObject



122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/extraloop/iterative_scraper.rb', line 122

def run
  @base_urls.each do |base_url|

    # run an extra iteration to determine the value of the next offset parameter (if #continue_with is used)
    # or the entire iteration set (if #set_iteration is used).
    (run_iteration(base_url); @iteration_count += 1 ) if @iteration_extractor_args || @continue_clause_args

    while @iteration_set.at(@iteration_count)
      method = @options[:async] ? :run_iteration_async : :run_iteration
      send(method, base_url)
      @iteration_count += 1
    end

    #reset all counts
    @queued_count = 0
    @response_count = 0
    @iteration_count = 0
  end
  self
end

#set_iteration(param, *args, &block) ⇒ Object

Public

Specifies the collection of values over which the scraper should iterate. At each iteration, the current value in the iteration set will be included as part of the request parameters.

param - the name of the iteration parameter. args - Either an array of values, or a set the arguments to initialize an Extractor object.

Examples:

# Explicitly specify the iteration set (can be either a range or an array).

 IterativeScraper.new("http://my-site.com/events").
   set_iteration(:p, 1..10).

# Pass in a code block to dynamically extract the iteration set from the document.
# The code block will be passed to generate an Extractor that will be run at the first
# iteration. The iteration will not continue if the proc will return return a non empty 
# set of values.

fetch_page_numbers = proc { |elements|
  elements.map { |a|
     a.attr(:href).match(/p=(\d+)/)
     $1
  }.reject { |p| p == 1 }
}

IterativeScraper.new("http://my-site.com/events").
  set_iteration(:p, "div#pagination a", fetch_page_numbers)

Returns itself.



92
93
94
95
96
97
98
99
100
101
# File 'lib/extraloop/iterative_scraper.rb', line 92

def set_iteration(param, *args, &block)
  args << block if block
  if args.first.respond_to?(:map)
    @iteration_set = Array(args.first).map &:to_s
  else
    @iteration_extractor_args = [:pagination, *args]
  end
  set_iteration_param(param)
  self
end