Class: Scrapes::Page

Inherits:
Object
  • Object
show all
Includes:
Hpricot::Extractors, RuleParser
Defined in:
lib/scrapes/page.rb

Overview

The page class is used as a base class for scraping data out of one web page. To use it, you inherit from it and setup some rules. You can also use validators to ensure that the page was scraped correctly.

Setup

class MyPageScraper < Scrapes::Page
  rule :rule_name, blah
end

Scrapes::RuleParser explains the use of rules.

Auto Loading

Scrapes will automatically ‘require’ ruby files placed in a special ‘pages’ directory. The idea is to place one Scrapes::Page derived class per file in the pages directory, and have it required for you.

Validations

There are a few class methods that you can use to validate the contents you scraped from a given web page.

Constant Summary collapse

XSLTPROC =

:nodoc

'xsltproc'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from RuleParser

#extractor, #parse, #rule, #rule_1, #rules, #selector

Methods included from Hpricot::Extractors

content, contents, text, text_process, texts, word, words, #xml

Instance Attribute Details

#hpricotObject

Access the Hpricot object that the selectors are passed



72
73
74
# File 'lib/scrapes/page.rb', line 72

def hpricot
  @hpricot
end

#sessionObject

Access the session object that was used to fetch this page’s data



68
69
70
# File 'lib/scrapes/page.rb', line 68

def session
  @session
end

#uriObject

Access the URI where this page’s data came from



64
65
66
# File 'lib/scrapes/page.rb', line 64

def uri
  @uri
end

Class Method Details

.acts_as_array(method_to_call) ⇒ Object

Make Page.extract return an array by calling the given method. This can be very useful for when your class does nothing more than collect a set of links for some other page to process. It cases Session#page to call the given block once for each object returned from method_to_call.



119
120
121
# File 'lib/scrapes/page.rb', line 119

def self.acts_as_array (method_to_call)
  meta_eval { @as_array = method_to_call }
end

.extract(data, uri, session, &block) ⇒ Object

Called by the crawler to process a web page



203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/scrapes/page.rb', line 203

def self.extract (data, uri, session, &block)
  obj = process_page(data, uri, session)

  if meta_eval {@paginated}
    if obj.respond_to?(:next_page)
      sister = obj

      while sister_uri = sister.next_page
        sister = extract_sister(session, obj, sister_uri)
      end
    elsif obj.respond_to?(:link_for_page)
      (2 .. obj.pages).each do |page|
        sister_uri = obj.link_for_page(page)
        extract_sister(session, obj, sister_uri)
      end
    end
  end

  as_array = meta_eval {@as_array}
  obj = obj.send(as_array) if as_array

  return obj unless block
  obj.respond_to?(:each) ? obj.each {|o| yield(o)} : yield(obj)
end

.paginatedObject

If the page that you are parsing is paginated (one page in many of similar data) you can use this class method to automatically fetch all pages. In order for this to work, you need to provide a few special methods:

Next Page

If you know the URL to the next page, then provide a instance method called next_page. It should return the URL for the next page, or nil when the current page is the last page.

class NextPageExample < Scrapes::Page

rule(:next_page, 'a[href~=next]', '@href', 1)

end

Link for Page

Alternatively, you can provide a instance method link_for_page and another one called pages. The pages method should return the number of pages in this paginated set. The link_for_page method should take a page number, and return a URL to fetch that page.

class LinkForPageExample < Scrapes::Page

rule_1(:page) {|e| m = e.text.match(/Page\s+\d+\s+of\s+(\d+)/) and m[1].to_i}

def link_for_page (page)
  uri.sub(/page=\d+/, "page=#{page}")
end

end

Append to Page

Finally, you must provide a append_page method. It takes an instance of your Scrapes::Page derived class as an argument. Its job is to add the data found on the current page to its instance variables. This is because when you use paginated, it only returns one instance of your class.



110
111
112
# File 'lib/scrapes/page.rb', line 110

def self.paginated
  meta_eval { @paginated = true }
end

.to(other_class) ⇒ Object

If using acts_as_array that returns links, send them to another class



197
198
199
# File 'lib/scrapes/page.rb', line 197

def self.to (other_class)
  ToProxy.new(self, other_class)
end

.validates_format_of(*attrs) ⇒ Object

Ensure that the given attributes have the correct format



155
156
157
158
159
160
161
162
# File 'lib/scrapes/page.rb', line 155

def self.validates_format_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message  => 'did not match regular expression',
    :with     => /.*/,
  })

  validates_from(attrs, options, lambda {|a| a.to_s.match(options[:with])})
end

.validates_inclusion_of(*attrs) ⇒ Object

Ensure that the given attributes have values in the given list



166
167
168
169
170
171
172
173
# File 'lib/scrapes/page.rb', line 166

def self.validates_inclusion_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message  => 'is not in the list of accepted values',
    :in       => [],
  })

  validates_from(attrs, options, lambda {|a| options[:in].include?(a)})
end

.validates_not_blank(*attrs) ⇒ Object

Ensure that the given attributes are not #blank?



145
146
147
148
149
150
151
# File 'lib/scrapes/page.rb', line 145

def self.validates_not_blank (*attrs)
  attrs, options = attrs_options(attrs, {
    :message => 'rule never matched',
  })

  validates_from(attrs, options, lambda {|a| !a.blank?})
end

.validates_numericality_of(*attrs) ⇒ Object

Ensure that the given attribute is a number



177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/scrapes/page.rb', line 177

def self.validates_numericality_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message  => 'is not a number',
  })

  closure = lambda do |a|
    begin
      Kernel.Float(a.to_s)
    rescue ArgumentError, TypeError
      false
    else
      true
    end
  end

  validates_from(attrs, options, closure)
end

.validates_presence_of(*attrs) ⇒ Object

Ensure that the given attributes have been set by matching rules



135
136
137
138
139
140
141
# File 'lib/scrapes/page.rb', line 135

def self.validates_presence_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message => 'rule never matched',
  })

  validates_from(attrs, options, lambda {|a| !a.nil?})
end

.with_xslt(filename) ⇒ Object

Preprocess the HTML by sending it through an XSLT stylesheet. The stylesheet should return a document that can be then processed using your rules. Using this feature requires that you have the xsltproc utility in your PATH. You can get xsltproc from libxslt: xmlsoft.org/XSLT/



128
129
130
131
# File 'lib/scrapes/page.rb', line 128

def self.with_xslt (filename)
  raise "#{XSLTPROC} could not be found" unless `#{XSLTPROC} --version 2>&1`.match(/libxslt/)
  meta_eval { @with_xslt = filename }
end

Instance Method Details

#after_parseObject

Have a chance to do something after parsing, but before validataion



230
231
# File 'lib/scrapes/page.rb', line 230

def after_parse
end

#validateObject

Called by the extract method to validate scraped data. If you override this method, you should call super. This method will probably be changed in the future so that you don’t have to call super.



237
238
239
240
241
242
243
244
245
246
# File 'lib/scrapes/page.rb', line 237

def validate
  validations = self.class.meta_eval { @validations }

  validations.each do |v|
    raise "#{self.class}.#{v[:name]} #{v[:options][:message]}" unless
      v[:proc].call(send(v[:name]))
  end

  self
end