Class: Scraper::Base

Inherits:

Object

Object
Scraper::Base

show all

Defined in:: lib/scraper/base.rb

Direct Known Subclasses

Microformats::HAtom, Microformats::HAtom::Entry, Microformats::HAtom::Feed, Microformats::HCard

Defined Under Namespace

Classes: PageInfo

Constant Summary collapse

READER_OPTIONS =

[:last_modified, :etag, :redirect_limit, :user_agent, :timeout]

Instance Attribute Summary collapse

#extracted ⇒ Object

Set to true when the first extractor returns true.
#options ⇒ Object

Returns the options for this object.
#page_info ⇒ Object

Information about the HTML page scraped.

Class Method Summary collapse

.array(*symbols) ⇒ Object

Declares which accessors are arrays.
.element(element) ⇒ Object

Returns the element itself.
.extractor(map) ⇒ Object

Creates an extractor that will extract values from the selected element and place them in instance variables of the scraper.
.options ⇒ Object

Returns the options for this class.
.parser(name = :tidy) ⇒ Object

Specifies which parser to use.
.parser_options(options) ⇒ Object

Options to pass to the parser.
.process(*selector, &block) ⇒ Object

:call-seq: process(symbol?, selector, values?, extractor) process(symbol?, selector, values?) { |element| … }.
.process_first(*selector, &block) ⇒ Object

Similar to #process, but only extracts from the first selected element.
.result(*symbols) ⇒ Object

Modifies this scraper to return a single value or a structure.
.root_element(name) ⇒ Object

The root element to scrape.
.rules ⇒ Object

Returns an array of rules defined for this class.
.scrape(source, options = nil) ⇒ Object

Scrapes the document and returns the result.
.selector(symbol, *selector, &block) ⇒ Object

:call-seq: selector(symbol, selector, values?) selector(symbol, selector, values?) { |elements| … }.
.text(element) ⇒ Object

Returns the text of the element.

Instance Method Summary collapse

#collect ⇒ Object

Called by #scrape scraping the document, and before calling #result.
#document ⇒ Object

Returns the document being processed.
#initialize(source, options = nil) ⇒ Base constructor

Create a new scraper instance.
#option(symbol) ⇒ Object

Returns the value of an option.
#prepare(document) ⇒ Object

Called by #scrape after creating the document, but before running any processing rules.
#request(url, options) ⇒ Object
#result ⇒ Object

Returns the result of a succcessful scrape.
#scrape ⇒ Object

Scrapes the document and returns the result.
#skip(elements = nil) ⇒ Object

:call-seq: skip() => true skip(element) => true skip([element …]) => true.
#stop ⇒ Object

Stops processing this page.

Constructor Details

#initialize(source, options = nil) ⇒ `Base`

Create a new scraper instance.

The argument source is a URL, string containing HTML, or HTML::Node. The optional argument options are options passed to the scraper. See Base#scrape for more details.

For example:

# The page we want to scrape
url = URI.parse("http://example.com")
# Skip the header
scraper = MyScraper.new(url, :root_element=>"body")
result = scraper.scrape

# File 'lib/scraper/base.rb', line 715

def initialize(source, options = nil)
  @page_info = PageInfo[]
  @options = options || {}
  case source
  when URI
    @document = source
  when String, HTML::Node
    @document = source
    # TODO: document and test case these two.
    @page_info.url = @page_info.original_url = @options[:url]
    @page_info.encoding = @options[:encoding]
  else
    raise ArgumentError, "Can only scrape URI, String or HTML::Node"
  end
end

Instance Attribute Details

#extracted ⇒ `Object`

Set to true when the first extractor returns true.



692
693
694

# File 'lib/scraper/base.rb', line 692

def extracted
  @extracted
end

#options ⇒ `Object`

Returns the options for this object.



700
701
702

# File 'lib/scraper/base.rb', line 700

def options
  @options
end

#page_info ⇒ `Object`

Information about the HTML page scraped. See PageInfo.



696
697
698

# File 'lib/scraper/base.rb', line 696

def page_info
  @page_info
end

Class Method Details

.array(*symbols) ⇒ `Object`

Declares which accessors are arrays. You can declare the accessor here, or use “symbol[]” as the target.

For example:

array :urls
process "a[href]", :urls=>"@href"

Is equivalent to:

process "a[href]", "urls[]"=>"@href"

# File 'lib/scraper/base.rb', line 473

def array(*symbols)
  @arrays ||= []
  symbols.each do |symbol|
    symbol = symbol.to_sym
    @arrays << symbol
    begin
      self.instance_method(symbol)
    rescue NameError
      attr_accessor symbol
    end
  end
end

.element(element) ⇒ `Object`

Returns the element itself.

You can use this method from an extractor, e.g.:

process "h1", :header=>:element



373
374
375

# File 'lib/scraper/base.rb', line 373

def element(element)
  element
end

.extractor(map) ⇒ `Object`

Creates an extractor that will extract values from the selected element and place them in instance variables of the scraper. You can pass the result to #process.

Example

This example processes a document looking for an element with the class name article. It extracts the attribute id and stores it in the instance variable @id. It extracts the article node itself and puts it in the instance variable @article.

class ArticleScraper < Scraper::Base
  process ".article", extractor(:id=>"@id", :article=>:element)
  attr_reader :id, :article
end
result = ArticleScraper.scrape(html)
puts result.id
puts result.article

Sources

Extractors operate on the selected element, and can extract the following values:

"elem_name" – Extracts the element itself if it matches the element name (e.g. “h2” will extract only level 2 header elements).
"attr_name" – Extracts the attribute value from the element if specified (e.g. “@id” will extract the id attribute).
"elem_name@attr_name" – Extracts the attribute value from the element if specified, but only if the element has the specified name (e.g. “h2@id”).
:element – Extracts the element itself.
:text – Extracts the text value of the node.
Scraper – Using this class creates a scraper to process the current element and extract the result. This can be used for handling complex structure.

If you use an array of sources, the first source that matches anything is used. For example, ["attr@title", :text] extracts the value of the title attribute if the element is abbr, otherwise the text value of the element.

If you use a hash, you can extract multiple values at the same time. For example, {:id=>"@id", :class=>"@class"} extracts the id and class attribute values.

:element and :text are special cases of symbols. You can pass any symbol that matches a class method and that class method will be called to extract a value from the selected element. You can also pass a Proc or Method directly.

And it’s always possible to pass a static value, quite useful for processing an element with more than one rule (:skip=>false).

Targets

Extractors assign the extracted value to an instance variable of the scraper. The instance variable contains the last value extracted.

Also creates an accessor for that instance variable. An accessor is created if no such method exists. For example, :title=>:text creates an accessor for title. However, :id=>"@id" does not create an accessor since each object already has a method called id.

If you want to extract multiple values into the same variables, use #array to declare that accessor as an array.

Alternatively, you can append [] to the variable name. For example:

process "*", "ids[]"=>"@id"
result :ids

The special target :skip allows you to control whether other rules can apply to the same element. By default a processing rule without a block (or a block that returns true) will skip that element so no other processing rule sees it.

You can change this with :skip=>false.

# File 'lib/scraper/base.rb', line 283

def extractor(map)
  extracts = []
  map.each_pair do |target, source|
    source = extract_value_from(source)
    target = extract_value_to(target)
    define_method :__extractor do |element|
      value = source.call(element)
      target.call(self, value) if !value.nil?
    end
    extracts << instance_method(:__extractor)
    remove_method :__extractor
  end
  lambda do |element|
    extracts.each do |extract|
      extract.bind(self).call(element)
    end
    true
  end
end

.options ⇒ `Object`

Returns the options for this class.



412
413
414

# File 'lib/scraper/base.rb', line 412

def options()
  @options ||= {}
end

.parser(name = :tidy) ⇒ `Object`

Specifies which parser to use. The default is :tidy.



379
380
381

# File 'lib/scraper/base.rb', line 379

def parser(name = :tidy)
  self.options[:parser] = name
end

.parser_options(options) ⇒ `Object`

Options to pass to the parser.

For example, when using Tidy, you can use these options to tell Tidy how to clean up the HTML.

This method sets the option for the class. Classes inherit options from their parents. You can also pass options to the scraper object itself using the :parser_options option.



392
393
394

# File 'lib/scraper/base.rb', line 392

def parser_options(options)
  self.options[:parser_options] = options
end

.process(*selector, &block) ⇒ `Object`

:call-seq:

process(symbol?, selector, values?, extractor)
process(symbol?, selector, values?) { |element| ... }

Defines a processing rule. A processing rule consists of a selector that matches element, and an extractor that does something interesting with their value.

Symbol

Rules are processed in the order in which they are defined. Use #rules if you need to change the order of processing.

Rules can be named or anonymous. If the first argument is a symbol, it is used as the rule name. You can use the rule name to position, remove or replace it.

Selector

The first argument is a selector. It selects elements from the document that are potential candidates for extraction. Each selected element is passed to the extractor.

The selector argument may be a string, an HTML::Selector object or any object that responds to the select method. Passing an Array (responds to select) will not do anything useful.

String selectors support value substitution, replacing question marks (?) in the selector expression with values from the method arguments. See HTML::Selector for more information.

Extractor

The last argument or block is the extractor. The extractor does something interested with the selected element, typically assigns it to an instance variable of the scraper.

Since the extractor is called on the scraper, it can also use the scraper to maintain state, e.g. this extractor counts how many div elements appear in the document:

process "div" { |element| @count += 1 }

The extractor returns true if the element was processed and should not be passed to any other extractor (including any child elements).

The default implementation of #result returns self only if at least one extractor returned true. However, you can override #result and use extractors that return false.

A block extractor is called with a single element.

You can also use the #extractor method to create extractors that assign elements, attributes and text values to instance variables, or pass a Hash as the last argument to #process. See #extractor for more information.

When using a block, the last statement is the response. Do not use return, use next if you want to return a value before the last statement. return does not do what you expect it to.

Example

class ScrapePosts < Scraper::Base
  # Select the title of a post
  selector :select_title, "h2"

  # Select the body of a post
  selector :select_body, ".body"

  # All elements with class name post.
  process ".post" do |element|
    title = select_title(element)
    body = select_body(element)
    @posts << Post.new(title, body)
    true
  end

  attr_reader :posts
end

posts = ScrapePosts.scrape(html).posts

To process only a single element:

class ScrapeTitle < Scraper::Base
  process "html>head>title", :title=>text
  result :title
end

puts ScrapeTitle.scrape(html)



123
124
125

# File 'lib/scraper/base.rb', line 123

def process(*selector, &block)
  create_process(false, *selector, &block)
end

.process_first(*selector, &block) ⇒ `Object`

Similar to #process, but only extracts from the first selected element. Faster if you know the document contains only one applicable element, or only interested in processing the first one.



132
133
134

# File 'lib/scraper/base.rb', line 132

def process_first(*selector, &block)
  create_process(true, *selector, &block)
end

.result(*symbols) ⇒ `Object`

Modifies this scraper to return a single value or a structure. Use in combination with accessors.

When called with one symbol, scraping returns the result of calling that method (typically an accessor). When called with two or more symbols, scraping returns a structure of values, one for each symbol.

For example:

class ScrapeTitle < Scraper::Base
  process_first "html>head>title", :title=>:text
  result :title
end

puts "Title: " + ScrapeTitle.scrape(html)

class ScrapeDts < Scraper::Base
  process ".dtstart", :dtstart=>["abbr@title", :text]
  process ".dtend", :dtend=>["abbr@title", :text]
  result :dtstart, :dtend
end

dts = ScrapeDts.scrape(html)
puts "Starts: #{dts.dtstart}"
puts "Ends: #{dts.dtend}"

Raises:

(ArgumentError)

# File 'lib/scraper/base.rb', line 449

def result(*symbols)
  raise ArgumentError, "Use one symbol to return the value of this accessor, multiple symbols to returns a structure" if symbols.empty?
  symbols = symbols.map {|s| s.to_sym}
  if symbols.size == 1
    define_method :result do
      return self.send(symbols[0])
    end
  else
    struct = Struct.new(*symbols)
    define_method :result do
      return struct.new(*symbols.collect {|s| self.send(s) })
    end
  end
end

.root_element(name) ⇒ `Object`

The root element to scrape.

The root element for an HTML document is html. However, if you want to scrape only the header or body, you can set the root_element to head or body.

This method sets the root element for the class. Classes inherit this option from their parents. You can also pass a root element to the scraper object itself using the :root_element option.



406
407
408

# File 'lib/scraper/base.rb', line 406

def root_element(name)
  self.options[:root_element] = name ? name.to_s : nil
end

.rules ⇒ `Object`

Returns an array of rules defined for this class. You can use this array to change the order of rules.



419
420
421

# File 'lib/scraper/base.rb', line 419

def rules()
  @rules ||= []
end

.scrape(source, options = nil) ⇒ `Object`

Scrapes the document and returns the result.

The first argument provides the input document. It can be one of:

URI – Retrieve an HTML page from this URL and scrape it.
String – The HTML page as a string.
HTML::Node – An HTML node, can be a document or element.

You can specify options for the scraper class, or override these by passing options in the second argument. Some options only make sense in the constructor.

The following options are supported for reading HTML pages:

:last_modified – Last-Modified header used for caching.
:etag – ETag header used for caching.
:redirect_limit – Limits number of redirects to follow.
:user_agent – Value for User-Agent header.
:timeout – HTTP open connection/read timeouts (in second).

The following options are supported for parsing the HTML:

:root_element – The root element to scrape, see also #root_elements.
:parser – Specifies which parser to use. (Typically, you set this for the class).
:parser_options – Options to pass to the parser.

The result is returned by calling the #result method. The default implementation returns self if any extractor returned true, nil otherwise.

For example:

result = MyScraper.scrape(url, :root_element=>"body")

The method may raise any number of exceptions. HTTPError indicates it failed to retrieve the HTML page, and HTMLParseError that it failed to parse the page. Other exceptions come from extractors and the #result method.

# File 'lib/scraper/base.rb', line 345

def scrape(source, options = nil)
  scraper = self.new(source, options);
  return scraper.scrape
end

.selector(symbol, *selector, &block) ⇒ `Object`

:call-seq:

selector(symbol, selector, values?)
selector(symbol, selector, values?) { |elements| ... }

Create a selector method. You can call a selector method directly to select elements.

For example, define a selector:

selector :five_divs, "div" { |elems| elems[0..4] }

And call it to retrieve the first five div elements:

divs = five_divs(element)

Call a selector method with an element and it returns an array of elements that match the selector, beginning with the element argument itself. It returns an empty array if nothing matches.

If the selector is defined with a block, all selected elements are passed to the block and the result of the block is returned.

For convenience, a first_ method is also created that returns (and yields) only the first selected element. For example:

selector :post, "#post"
@post = first_post

Since the selector is defined with a block, both methods call that block with an array of elements.

The selector argument may be a string, an HTML::Selector object or any object that responds to the select method. Passing an Array (responds to select) will not do anything useful.

String selectors support value substitution, replacing question marks (?) in the selector expression with values from the method arguments. See HTML::Selector for more information.

When using a block, the last statement is the response. Do not use return, use next if you want to return a value before the last statement. return does not do what you expect it to.

Raises:

(ArgumentError)

# File 'lib/scraper/base.rb', line 175

def selector(symbol, *selector, &block)
  raise ArgumentError, "Missing selector: the first argument tells us what to select" if selector.empty?
  if selector[0].is_a?(String)
    selector = HTML::Selector.new(*selector)
  else
    raise ArgumentError, "Selector must respond to select() method" unless selector.respond_to?(:select)
    selector = selector[0]
  end
  if block
    define_method symbol do |element|
      selected = selector.select(element)
      return block.call(selected) unless selected.empty?
    end
    define_method "first_#{symbol}" do |element|
      selected = selector.select_first(element)
      return block.call([selected]) if selected
    end
  else
    define_method symbol do |element|
      return selector.select(element)
    end
    define_method "first_#{symbol}" do |element|
      return selector.select_first(element)
    end
  end
end

.text(element) ⇒ `Object`

Returns the text of the element.

You can use this method from an extractor, e.g.:

process "title", :title=>:text

# File 'lib/scraper/base.rb', line 355

def text(element)
  text = ""
  stack = element.children.reverse
  while node = stack.pop
    if node.tag?
      stack.concat node.children.reverse
    else
      text << node.content
    end
  end
  return text
end

Instance Method Details

#collect ⇒ `Object`

Called by #scrape scraping the document, and before calling #result. Typically used to run any validation, post-processing steps, resolving referenced elements, etc.



939
940

# File 'lib/scraper/base.rb', line 939

def collect()
end

#document ⇒ `Object`

Returns the document being processed.

If the scraper was created with a URL, this method will attempt to retrieve the page and parse it.

If the scraper was created with a string, this method will attempt to parse the page.

Be advised that calling this method may raise an exception (HTTPError or HTMLParseError).

The document is parsed only the first time this method is called.

Raises:

(RuntimeError)

# File 'lib/scraper/base.rb', line 856

def document
  if @document.is_a?(URI)
    # Attempt to read page. May raise HTTPError.
    options = {}
    READER_OPTIONS.each { |key| options[key] = option(key) }
    request(@document, options)
  end
  if @document.is_a?(String)
    # Parse the page. May raise HTMLParseError.
    parsed = Reader.parse_page(@document, @page_info.encoding,
                               option(:parser_options), option(:parser))
    @document = parsed.document
    @page_info.encoding = parsed.encoding
  end
  return @document if @document.is_a?(HTML::Node)
  raise RuntimeError, "No document to process"
end

#option(symbol) ⇒ `Object`

Returns the value of an option.

Returns the value of an option passed to the scraper on creation. If not specified, return the value of the option set for this scraper class. Options are inherited from the parent class.



967
968
969

# File 'lib/scraper/base.rb', line 967

def option(symbol)
  return options.has_key?(symbol) ? options[symbol] : self.class.options[symbol]
end

#prepare(document) ⇒ `Object`

Called by #scrape after creating the document, but before running any processing rules.

You can override this method to do any preparation work.



932
933

# File 'lib/scraper/base.rb', line 932

def prepare(document)
end

#request(url, options) ⇒ `Object`

# File 'lib/scraper/base.rb', line 875

def request(url, options)
  if page = Reader.read_page(@document, options)
    @page_info.url = page.url
    @page_info.original_url = @document
    @page_info.last_modified = page.last_modified
    @page_info.etag = page.etag
    @page_info.encoding = page.encoding
    @document = page.content
  end
end

#result ⇒ `Object`

Returns the result of a succcessful scrape.

This method is called by #scrape after running all the rules on the document. You can also call it directly.

Override this method to return a specific object, perform post-scraping processing, validation, etc.

The default implementation returns self if any extractor returned true, nil otherwise.

If you override this method, implement your own logic to determine if anything was extracted and return nil otherwise. Also, make sure calling this method multiple times returns the same result.



957
958
959

# File 'lib/scraper/base.rb', line 957

def result()
  return self if @extracted
end

#scrape ⇒ `Object`

Scrapes the document and returns the result.

If the scraper was created with a URL, retrieve the page and parse it. If the scraper was created with a string, parse the page.

The result is returned by calling the #result method. The default implementation returns self if any extractor returned true, nil otherwise.

#skip(elements = nil) ⇒ `Object`

:call-seq:

skip() => true
skip(element) => true
skip([element ...]) => true

Skips processing the specified element(s).

If called with a single element, that element will not be processed.

If called with an array of elements, all the elements in the array are skipped.

If called with no element, skips processing the current element. This has the same effect as returning true.

For convenience this method always returns true. For example:

process "h1" do |element|
  @header = element
  skip
end

# File 'lib/scraper/base.rb', line 907

def skip(elements = nil)
  case elements
  when Array then @skip.concat elements
  when HTML::Node then @skip << elements
  when nil then @skip << true
  when true, false then @skip << elements
  end
  # Calling skip(element) as the last statement is
  # redundant by design.
  return true
end

#stop ⇒ `Object`

Stops processing this page. You can call this early on if you discover there is no interesting information on the page, or done extracting all useful information.



923
924
925

# File 'lib/scraper/base.rb', line 923

def stop()
  @stop = true
end

Class: Scraper::Base

Direct Known Subclasses

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, options = nil) ⇒ Base

Instance Attribute Details

#extracted ⇒ Object

#options ⇒ Object

#page_info ⇒ Object

Class Method Details

.array(*symbols) ⇒ Object

.element(element) ⇒ Object

.extractor(map) ⇒ Object

Example

Sources

Targets

.options ⇒ Object

.parser(name = :tidy) ⇒ Object

.parser_options(options) ⇒ Object

.process(*selector, &block) ⇒ Object

Symbol

Selector

Extractor

Example

.process_first(*selector, &block) ⇒ Object

.result(*symbols) ⇒ Object

.root_element(name) ⇒ Object

.rules ⇒ Object

.scrape(source, options = nil) ⇒ Object

.selector(symbol, *selector, &block) ⇒ Object

.text(element) ⇒ Object

Instance Method Details

#collect ⇒ Object

#document ⇒ Object

#option(symbol) ⇒ Object

#prepare(document) ⇒ Object

#request(url, options) ⇒ Object

#result ⇒ Object

#scrape ⇒ Object

#skip(elements = nil) ⇒ Object

#stop ⇒ Object

#initialize(source, options = nil) ⇒ `Base`

#extracted ⇒ `Object`

#options ⇒ `Object`

#page_info ⇒ `Object`

.array(*symbols) ⇒ `Object`

.element(element) ⇒ `Object`

.extractor(map) ⇒ `Object`

.options ⇒ `Object`

.parser(name = :tidy) ⇒ `Object`

.parser_options(options) ⇒ `Object`

.process(*selector, &block) ⇒ `Object`

.process_first(*selector, &block) ⇒ `Object`

.result(*symbols) ⇒ `Object`

.root_element(name) ⇒ `Object`

.rules ⇒ `Object`

.scrape(source, options = nil) ⇒ `Object`

.selector(symbol, *selector, &block) ⇒ `Object`

.text(element) ⇒ `Object`

#collect ⇒ `Object`

#document ⇒ `Object`

#option(symbol) ⇒ `Object`

#prepare(document) ⇒ `Object`

#request(url, options) ⇒ `Object`

#result ⇒ `Object`

#scrape ⇒ `Object`

#skip(elements = nil) ⇒ `Object`

#stop ⇒ `Object`