Class: Upton::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/upton.rb,
lib/upton/scraper.rb

Overview

  1. specifying the pages to be scraped in ‘new` as an index page

    or as an Array of URLs.
    
  2. supplying a block to ‘scrape` or `scrape_to_csv` or using a pre-build

    block from Upton::Utils.
    

For more complicated cases; subclass Upton::Scraper

e.g. +MyScraper < Upton::Scraper+ and override various methods.

Constant Summary collapse

EMPTY_STRING =
''

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(index_url_or_array, selector = "") ⇒ Scraper

index_url_or_array: A list of string URLs, OR

the URL of the page containing the list of instances.

selector: The XPath expression or CSS selector that specifies the

anchor elements within the page, if a url is specified for
the previous argument.

These options are a shortcut. If you plan to override get_index, you do not need to set them. If you don’t specify a selector, the first argument will be treated as a list of URLs.



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# File 'lib/upton.rb', line 65

def initialize(index_url_or_array, selector="")

  #if first arg is a valid URL, do already-written stuff;
  #if it's not (or if it's a list?) don't bother with get_index, etc.
  #e.g. Scraper.new(["http://jeremybmerrill.com"])

  #TODO: rewrite this, because it's a little silly. (i.e. should be a more sensical division of how these arguments work)
  if index_url_or_array.respond_to? :each_with_index
    @url_array = index_url_or_array
  else
    @index_url = index_url_or_array
    @index_selector = selector
  end

  # If true, then Upton prints information about when it gets
  # files from the internet and when it gets them from its stash.
  @verbose = false

  # If true, then Upton fetches each instance page only once
  # future requests for that file are responded to with the locally stashed
  # version.
  # You may want to set @debug to false for production (but maybe not).
  # You can also control stashing behavior on a per-call basis with the
  # optional second argument to get_page, if, for instance, you want to
  # stash certain instance pages, e.g. based on their modification date.
  @debug = true
  # Index debug does the same, but for index pages.
  @index_debug = false

  # In order to not hammer servers, Upton waits for, by default, 30
  # seconds between requests to the remote server.
  @sleep_time_between_requests = 30 #seconds

  # If true, then Upton will attempt to scrape paginated index pages
  @paginated = false
  # Default query string parameter used to specify the current page
  @pagination_param = 'page'
  # Default number of paginated pages to scrape
  @pagination_max_pages = 2
  # Default starting number for pagination (second page is this plus 1).
  @pagination_start_index = 1
  # Default value to increment page number by
  @pagination_interval = 1
 
  # Folder name for stashes, if you want them to be stored somewhere else,
  # e.g. under /tmp.
  if @stash_folder
    FileUtils.mkdir_p(@stash_folder) unless Dir.exists?(@stash_folder)
  end
end

Instance Attribute Details

#debugObject

Returns the value of attribute debug.



37
38
39
# File 'lib/upton.rb', line 37

def debug
  @debug
end

#index_debugObject

Returns the value of attribute index_debug.



37
38
39
# File 'lib/upton.rb', line 37

def index_debug
  @index_debug
end

#paginatedObject

Returns the value of attribute paginated.



37
38
39
# File 'lib/upton.rb', line 37

def paginated
  @paginated
end

#pagination_intervalObject

Returns the value of attribute pagination_interval.



37
38
39
# File 'lib/upton.rb', line 37

def pagination_interval
  @pagination_interval
end

#pagination_max_pagesObject

Returns the value of attribute pagination_max_pages.



37
38
39
# File 'lib/upton.rb', line 37

def pagination_max_pages
  @pagination_max_pages
end

#pagination_paramObject

Returns the value of attribute pagination_param.



37
38
39
# File 'lib/upton.rb', line 37

def pagination_param
  @pagination_param
end

#pagination_start_indexObject

Returns the value of attribute pagination_start_index.



37
38
39
# File 'lib/upton.rb', line 37

def pagination_start_index
  @pagination_start_index
end

#readable_filenamesObject

Returns the value of attribute readable_filenames.



37
38
39
# File 'lib/upton.rb', line 37

def readable_filenames
  @readable_filenames
end

#sleep_time_between_requestsObject

Returns the value of attribute sleep_time_between_requests.



37
38
39
# File 'lib/upton.rb', line 37

def sleep_time_between_requests
  @sleep_time_between_requests
end

#stash_folderObject

Returns the value of attribute stash_folder.



37
38
39
# File 'lib/upton.rb', line 37

def stash_folder
  @stash_folder
end

#url_arrayObject

Returns the value of attribute url_array.



37
38
39
# File 'lib/upton.rb', line 37

def url_array
  @url_array
end

#verboseObject

Returns the value of attribute verbose.



37
38
39
# File 'lib/upton.rb', line 37

def verbose
  @verbose
end

Instance Method Details

#next_index_page_url(url, pagination_index) ⇒ Object

Return the next URL to scrape, given the current URL and its index.

Recursion stops if the fetching URL returns an empty string or an error.

If @paginated is not set (the default), this method returns an empty string.

If @paginated is set, this method will return the next pagination URL to scrape using @pagination_param and the pagination_index.

If the pagination_index is greater than @pagination_max_pages, then the method will return an empty string.

Override this method to handle pagination is an alternative way e.g. next_index_page_url(“whatever.com/articles?page=1”, 2) ought to return “whatever.com/articles?page=2



149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# File 'lib/upton.rb', line 149

def next_index_page_url(url, pagination_index)
  return url unless @paginated

  if pagination_index > @pagination_max_pages
    puts "Exceeded pagination limit of #{@pagination_max_pages}" if @verbose
    EMPTY_STRING
  else
    uri = URI.parse(url)
    query = uri.query ? Hash[URI.decode_www_form(uri.query)] : {}
    # update the pagination query string parameter
    query[@pagination_param] = pagination_index
    uri.query = URI.encode_www_form(query)
    puts "Next index pagination url is #{uri}" if @verbose
    uri.to_s
  end
end

#next_instance_page_url(url, pagination_index) ⇒ Object

If instance pages are paginated, you must override this method to return the next URL, given the current URL and its index.

If instance pages aren’t paginated, there’s no need to override this.

Recursion stops if the fetching URL returns an empty string or an error.

e.g. next_instance_page_url(“whatever.com/article/upton-sinclairs-the-jungle?page=1”, 2) ought to return “whatever.com/article/upton-sinclairs-the-jungle?page=2



127
128
129
# File 'lib/upton.rb', line 127

def next_instance_page_url(url, pagination_index)
  EMPTY_STRING
end

#scrape(&blk) ⇒ Object

This is the main user-facing method for a basic scraper. Call scrape with a block; this block will be called on the text of each instance page, (and optionally, its URL and its index in the list of instance URLs returned by get_index).



47
48
49
50
51
# File 'lib/upton.rb', line 47

def scrape(&blk)
  self.url_array = self.get_index unless self.url_array
  blk = Proc.new{|x| x} if blk.nil?
  self.scrape_from_list(self.url_array, blk)
end

#scrape_to_csv(filename, &blk) ⇒ Object

Writes the scraped result to a CSV at the given filename.



169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/upton.rb', line 169

def scrape_to_csv filename, &blk
  require 'csv'
  self.url_array = self.get_index unless self.url_array
  CSV.open filename, 'wb' do |csv|
    #this is a conscious choice: each document is a list of things, either single elements or rows (as lists).
    self.scrape_from_list(self.url_array, blk).compact.each do |document|
      if document[0].respond_to? :map
        document.each{|row| csv << row }
      else
        csv << document
      end
    end
    #self.scrape_from_list(self.url_array, blk).compact.each{|document| csv << document }
  end
end

#scrape_to_tsv(filename, &blk) ⇒ Object



185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# File 'lib/upton.rb', line 185

def scrape_to_tsv filename, &blk
  require 'csv'
  self.url_array = self.get_index unless self.url_array
  CSV.open filename, 'wb', :col_sep => "\t" do |csv|
    #this is a conscious choice: each document is a list of things, either single elements or rows (as lists).
    self.scrape_from_list(self.url_array, blk).compact.each do |document|
      if document[0].respond_to? :map
        document.each{|row| csv << row }
      else
        csv << document
      end
    end
    #self.scrape_from_list(self.url_array, blk).compact.each{|document| csv << document }
  end
end