Class: Scrapes::Page

Inherits:

Object

Object
Scrapes::Page

Includes:: Hpricot::Extractors, RuleParser

Defined in:: lib/scrapes/page.rb

Overview

The page class is used as a base class for scraping data out of one web page. To use it, you inherit from it and setup some rules. You can also use validators to ensure that the page was scraped correctly.

Setup

class MyPageScraper < Scrapes::Page
  rule :rule_name, blah
end

Scrapes::RuleParser explains the use of rules.

Auto Loading

Scrapes will automatically ‘require’ ruby files placed in a special ‘pages’ directory. The idea is to place one Scrapes::Page derived class per file in the pages directory, and have it required for you.

Validations

There are a few class methods that you can use to validate the contents you scraped from a given web page.

Constant Summary collapse

XSLTPROC = :nodoc

'xsltproc'

Instance Attribute Summary collapse

#hpricot ⇒ Object

Access the Hpricot object that the selectors are passed.
#session ⇒ Object

Access the session object that was used to fetch this page’s data.
#uri ⇒ Object

Access the URI where this page’s data came from.

Class Method Summary collapse

.acts_as_array(method_to_call) ⇒ Object

Make Page.extract return an array by calling the given method.
.extract(data, uri, session, &block) ⇒ Object

Called by the crawler to process a web page.
.paginated ⇒ Object

If the page that you are parsing is paginated (one page in many of similar data) you can use this class method to automatically fetch all pages.
.to(other_class) ⇒ Object

If using acts_as_array that returns links, send them to another class.
.validates_format_of(*attrs) ⇒ Object

Ensure that the given attributes have the correct format.
.validates_inclusion_of(*attrs) ⇒ Object

Ensure that the given attributes have values in the given list.
.validates_not_blank(*attrs) ⇒ Object

Ensure that the given attributes are not #blank?.
.validates_numericality_of(*attrs) ⇒ Object

Ensure that the given attribute is a number.
.validates_presence_of(*attrs) ⇒ Object

Ensure that the given attributes have been set by matching rules.
.with_xslt(filename) ⇒ Object

Preprocess the HTML by sending it through an XSLT stylesheet.

Instance Method Summary collapse

#after_parse ⇒ Object

Have a chance to do something after parsing, but before validataion.
#validate ⇒ Object

Called by the extract method to validate scraped data.

Methods included from RuleParser

#extractor, #parse, #rule, #rule_1, #rules, #selector

Methods included from Hpricot::Extractors

content, contents, text, text_process, texts, word, words, #xml

Instance Attribute Details

#hpricot ⇒ `Object`

Access the Hpricot object that the selectors are passed



72
73
74

# File 'lib/scrapes/page.rb', line 72

def hpricot
  @hpricot
end

#session ⇒ `Object`

Access the session object that was used to fetch this page’s data



68
69
70

# File 'lib/scrapes/page.rb', line 68

def session
  @session
end

#uri ⇒ `Object`

Access the URI where this page’s data came from



64
65
66

# File 'lib/scrapes/page.rb', line 64

def uri
  @uri
end

Class Method Details

.acts_as_array(method_to_call) ⇒ `Object`

Make Page.extract return an array by calling the given method. This can be very useful for when your class does nothing more than collect a set of links for some other page to process. It cases Session#page to call the given block once for each object returned from method_to_call.



119
120
121

# File 'lib/scrapes/page.rb', line 119

def self.acts_as_array (method_to_call)
  meta_eval { @as_array = method_to_call }
end

.extract(data, uri, session, &block) ⇒ `Object`

Called by the crawler to process a web page

# File 'lib/scrapes/page.rb', line 203

def self.extract (data, uri, session, &block)
  obj = process_page(data, uri, session)

  if meta_eval {@paginated}
    if obj.respond_to?(:next_page)
      sister = obj

      while sister_uri = sister.next_page
        sister = extract_sister(session, obj, sister_uri)
      end
    elsif obj.respond_to?(:link_for_page)
      (2 .. obj.pages).each do |page|
        sister_uri = obj.link_for_page(page)
        extract_sister(session, obj, sister_uri)
      end
    end
  end

  as_array = meta_eval {@as_array}
  obj = obj.send(as_array) if as_array

  return obj unless block
  obj.respond_to?(:each) ? obj.each {|o| yield(o)} : yield(obj)
end

.paginated ⇒ `Object`

If the page that you are parsing is paginated (one page in many of similar data) you can use this class method to automatically fetch all pages. In order for this to work, you need to provide a few special methods:

If you know the URL to the next page, then provide a instance method called next_page. It should return the URL for the next page, or nil when the current page is the last page.

class NextPageExample < Scrapes::Page

rule(:next_page, 'a[href~=next]', '@href', 1)

end

Link for Page

Alternatively, you can provide a instance method link_for_page and another one called pages. The pages method should return the number of pages in this paginated set. The link_for_page method should take a page number, and return a URL to fetch that page.

class LinkForPageExample < Scrapes::Page

rule_1(:page) {|e| m = e.text.match(/Page\s+\d+\s+of\s+(\d+)/) and m[1].to_i}

def link_for_page (page)
  uri.sub(/page=\d+/, "page=#{page}")
end

end

Append to Page

Finally, you must provide a append_page method. It takes an instance of your Scrapes::Page derived class as an argument. Its job is to add the data found on the current page to its instance variables. This is because when you use paginated, it only returns one instance of your class.



110
111
112

# File 'lib/scrapes/page.rb', line 110

def self.paginated
  meta_eval { @paginated = true }
end

.to(other_class) ⇒ `Object`

If using acts_as_array that returns links, send them to another class



197
198
199

# File 'lib/scrapes/page.rb', line 197

def self.to (other_class)
  ToProxy.new(self, other_class)
end

.validates_format_of(*attrs) ⇒ `Object`

Ensure that the given attributes have the correct format

# File 'lib/scrapes/page.rb', line 155

def self.validates_format_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message  => 'did not match regular expression',
    :with     => /.*/,
  })

  validates_from(attrs, options, lambda {|a| a.to_s.match(options[:with])})
end

.validates_inclusion_of(*attrs) ⇒ `Object`

Ensure that the given attributes have values in the given list

# File 'lib/scrapes/page.rb', line 166

def self.validates_inclusion_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message  => 'is not in the list of accepted values',
    :in       => [],
  })

  validates_from(attrs, options, lambda {|a| options[:in].include?(a)})
end

.validates_not_blank(*attrs) ⇒ `Object`

Ensure that the given attributes are not #blank?

# File 'lib/scrapes/page.rb', line 145

def self.validates_not_blank (*attrs)
  attrs, options = attrs_options(attrs, {
    :message => 'rule never matched',
  })

  validates_from(attrs, options, lambda {|a| !a.blank?})
end

.validates_numericality_of(*attrs) ⇒ `Object`

Ensure that the given attribute is a number

# File 'lib/scrapes/page.rb', line 177

def self.validates_numericality_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message  => 'is not a number',
  })

  closure = lambda do |a|
    begin
      Kernel.Float(a.to_s)
    rescue ArgumentError, TypeError
      false
    else
      true
    end
  end

  validates_from(attrs, options, closure)
end

.validates_presence_of(*attrs) ⇒ `Object`

Ensure that the given attributes have been set by matching rules

# File 'lib/scrapes/page.rb', line 135

def self.validates_presence_of (*attrs)
  attrs, options = attrs_options(attrs, {
    :message => 'rule never matched',
  })

  validates_from(attrs, options, lambda {|a| !a.nil?})
end

.with_xslt(filename) ⇒ `Object`

Preprocess the HTML by sending it through an XSLT stylesheet. The stylesheet should return a document that can be then processed using your rules. Using this feature requires that you have the xsltproc utility in your PATH. You can get xsltproc from libxslt: xmlsoft.org/XSLT/

# File 'lib/scrapes/page.rb', line 128

def self.with_xslt (filename)
  raise "#{XSLTPROC} could not be found" unless `#{XSLTPROC} --version 2>&1`.match(/libxslt/)
  meta_eval { @with_xslt = filename }
end

Instance Method Details

#after_parse ⇒ `Object`

Have a chance to do something after parsing, but before validataion



230
231

# File 'lib/scrapes/page.rb', line 230

def after_parse
end

#validate ⇒ `Object`

Called by the extract method to validate scraped data. If you override this method, you should call super. This method will probably be changed in the future so that you don’t have to call super.

# File 'lib/scrapes/page.rb', line 237

def validate
  validations = self.class.meta_eval { @validations }

  validations.each do |v|
    raise "#{self.class}.#{v[:name]} #{v[:options][:message]}" unless
      v[:proc].call(send(v[:name]))
  end

  self
end

Class: Scrapes::Page

Overview

Setup

Auto Loading

Validations

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from RuleParser

Methods included from Hpricot::Extractors

Instance Attribute Details

#hpricot ⇒ Object

#session ⇒ Object

#uri ⇒ Object

Class Method Details

.acts_as_array(method_to_call) ⇒ Object

.extract(data, uri, session, &block) ⇒ Object

.paginated ⇒ Object

Next Page

Link for Page

Append to Page

.to(other_class) ⇒ Object

.validates_format_of(*attrs) ⇒ Object

.validates_inclusion_of(*attrs) ⇒ Object

.validates_not_blank(*attrs) ⇒ Object

.validates_numericality_of(*attrs) ⇒ Object

.validates_presence_of(*attrs) ⇒ Object

.with_xslt(filename) ⇒ Object