Class: Grubby::Scraper

Inherits:

Object

Object
Grubby::Scraper

show all

Defined in:: lib/grubby/scraper.rb

Direct Known Subclasses

JsonScraper, PageScraper

Defined Under Namespace

Classes: Error, FieldValueRequiredError

Instance Attribute Summary collapse

#errors ⇒ Hash<Symbol, StandardError> readonly

Collected errors raised during #initialize by blocks passed to Scraper.scrapes, indexed by field name.
#source ⇒ Object readonly

The object being scraped.

Class Method Summary collapse

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

Iterates a series of pages, starting at start.
.fields ⇒ Array<Symbol>

Fields defined by Scraper.scrapes.
.scrape(url, agent = $grubby) ⇒ Grubby::Scraper

Instantiates the Scraper class with the resource specified by url.
.scrapes(field, **options) { ... } ⇒ void

Defines an attribute reader method named by field.

Instance Method Summary collapse

#[](field) ⇒ Object

Returns the scraped value named by field.
#initialize(source) ⇒ Scraper constructor

A new instance of Scraper.
#to_h ⇒ Hash<Symbol, Object>

Returns all scraped values as a Hash.

Constructor Details

#initialize(source) ⇒ `Scraper`

Returns a new instance of Scraper.

Parameters:

source

Raises:

(Grubby::Scraper::Error) —

if any scraped values result in error

# File 'lib/grubby/scraper.rb', line 206

def initialize(source)
  @source = source
  @scraped = {}
  @errors = {}

  self.class.fields.each do |field|
    begin
      self.send(field)
    rescue FieldScrapeFailedError
    end
  end

  raise Error.new(self) unless @errors.empty?
end

Instance Attribute Details

#errors ⇒ `Hash<Symbol, StandardError>` (readonly)

Collected errors raised during #initialize by blocks passed to scrapes, indexed by field name. If #initialize did not raise Grubby::Scraper::Error, this Hash will be empty.

Returns:

(Hash<Symbol, StandardError>)



201
202
203

# File 'lib/grubby/scraper.rb', line 201

def errors
  @errors
end

#source ⇒ `Object` (readonly)

The object being scraped. Typically a Mechanize pluggable parser such as Mechanize::Page.

Returns:

(Object)



194
195
196

# File 'lib/grubby/scraper.rb', line 194

def source
  @source
end

Class Method Details

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ `void`

This method returns an undefined value.

Iterates a series of pages, starting at start. The Scraper class is instantiated with each page, and each instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method method on each previous scraper instance.

Iteration stops when the next_method method returns nil. If the next_method method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.

Examples:

class PostsIndexScraper < Grubby::PageScraper
  scrapes(:page_param){ page.uri.query_param("page") }

  def next
    page.link_with(text: "Next >")&.click
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page_param  # == "1", "2", "3", ...
end

class PostsIndexScraper < Grubby::PageScraper
  scrapes(:page_param){ page.uri.query_param("page") }

  scrapes(:next_uri, optional: true) do
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
  scraper.page_param  # == "1", "2", "3", ...
end

Parameters:

start (String, URI, Mechanize::Page, Mechanize::File)
agent (Mechanize) (defaults to: $grubby)
next_method (Symbol) (defaults to: :next)

Yields:

(scraper)

Yield Parameters:

scraper (Grubby::Scraper)

Raises:

(NoMethodError) —

if Scraper class does not implement next_method

# File 'lib/grubby/scraper.rb', line 174

def self.each(start, agent = $grubby, next_method: :next)
  unless self.method_defined?(next_method)
    raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
  end

  return to_enum(:each, start, agent, next_method: next_method) unless block_given?

  current = start
  while current
    current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
    scraper = self.new(current)
    yield scraper
    current = scraper.send(next_method)
  end
end

.fields ⇒ `Array<Symbol>`

Fields defined by scrapes.

Returns:

(Array<Symbol>)



94
95
96

# File 'lib/grubby/scraper.rb', line 94

def self.fields
  @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup
end

.scrape(url, agent = $grubby) ⇒ `Grubby::Scraper`

Instantiates the Scraper class with the resource specified by url. This method acts as a default factory method, and provides a standard interface for specialized overrides.

Examples:

Default factory method

class PostPageScraper < Grubby::PageScraper
  # ...
end

PostPageScraper.scrape("https://example.com/posts/42")
  # == PostPageScraper.new($grubby.get("https://example.com/posts/42"))

Specialized factory method

class PostApiScraper < Grubby::JsonScraper
  # ...

  def self.scrapes(url, agent = $grubby)
    api_url = url.sub(%r"//example.com/(.+)", '//api.example.com/\1.json')
    super(api_url, agent)
  end
end

PostApiScraper.scrape("https://example.com/posts/42")
  # == PostApiScraper.new($grubby.get("https://api.example.com/posts/42.json"))

Parameters:

url (String, URI)
agent (Mechanize) (defaults to: $grubby)

Returns:

(Grubby::Scraper)



126
127
128

# File 'lib/grubby/scraper.rb', line 126

def self.scrape(url, agent = $grubby)
  self.new(agent.get(url))
end

.scrapes(field, **options) { ... } ⇒ `void`

This method returns an undefined value.

Defines an attribute reader method named by field. During initialize, the given block is called, and the attribute is set to the block’s return value.

By default, if the block’s return value is nil, an exception will be raised. To prevent this behavior, specify optional: true.

The block may also be evaluated conditionally, based on another method’s return value, using the :if or :unless options.

Examples:

class GreetingScraper < Grubby::Scraper
  scrapes(:salutation) do
    source[/\A(hello|good morning)\b/i]
  end

  scrapes(:recipient, optional: true) do
    source[/\A#{salutation} ([a-z ]+)/i, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.salutation  # == "Hello"
scraper.recipient   # == "World"

scraper = GreetingScraper.new("Good morning!")
scraper.salutation  # == "Good morning"
scraper.recipient   # == nil

scraper = GreetingScraper.new("Hey!")  # raises Grubby::Scraper::Error

class EmbeddedUrlScraper < Grubby::Scraper
  scrapes(:url, optional: true){ source[%r"\bhttps?://\S+"] }

  scrapes(:domain, if: :url){ url[%r"://([^/]+)/", 1] }
end

scraper = EmbeddedUrlScraper.new("visit https://example.com/foo for details")
scraper.url     # == "https://example.com/foo"
scraper.domain  # == "example.com"

scraper = EmbeddedUrlScraper.new("visit our website for details")
scraper.url     # == nil
scraper.domain  # == nil

Parameters:

field (Symbol, String)
options (Hash)

Options Hash (**options):

:optional (Boolean)
:if (Symbol)
:unless (Symbol)

Yields:

Yield Returns:

(Object)

# File 'lib/grubby/scraper.rb', line 57

def self.scrapes(field, **options, &block)
  field = field.to_sym
  (self.fields << field).uniq!

  define_method(field) do
    raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)

    if !@scraped.key?(field) && !@errors.key?(field)
      begin
        skip = (options[:if] && !self.send(options[:if])) ||
          (options[:unless] && self.send(options[:unless]))

        if skip
          @scraped[field] = nil
        else
          @scraped[field] = instance_eval(&block)
          if @scraped[field].nil?
            raise FieldValueRequiredError.new(field) unless options[:optional]
            $log.debug("#{self.class}##{field} is nil")
          end
        end
      rescue RuntimeError, IndexError => e
        @errors[field] = e
      end
    end

    if @errors.key?(field)
      raise FieldScrapeFailedError.new(field, @errors[field])
    else
      @scraped[field]
    end
  end
end

Instance Method Details

#[](field) ⇒ `Object`

Returns the scraped value named by field.

Parameters:

field (Symbol, String)

Returns:

(Object)

Raises:

(RuntimeError) —

if field is not a valid name



227
228
229

# File 'lib/grubby/scraper.rb', line 227

def [](field)
  @scraped.fetch(field.to_sym)
end

#to_h ⇒ `Hash<Symbol, Object>`

Returns all scraped values as a Hash.

Returns:

(Hash<Symbol, Object>)



234
235
236

# File 'lib/grubby/scraper.rb', line 234

def to_h
  @scraped.dup
end

Class: Grubby::Scraper

Direct Known Subclasses

Defined Under Namespace

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Scraper

Instance Attribute Details

#errors ⇒ Hash<Symbol, StandardError> (readonly)

#source ⇒ Object (readonly)

Class Method Details

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

Examples:

.fields ⇒ Array<Symbol>

.scrape(url, agent = $grubby) ⇒ Grubby::Scraper

Examples:

Default factory method

Specialized factory method

.scrapes(field, **options) { ... } ⇒ void

Examples:

Instance Method Details

#[](field) ⇒ Object

#to_h ⇒ Hash<Symbol, Object>

#initialize(source) ⇒ `Scraper`

#errors ⇒ `Hash<Symbol, StandardError>` (readonly)

#source ⇒ `Object` (readonly)

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ `void`

.fields ⇒ `Array<Symbol>`

.scrape(url, agent = $grubby) ⇒ `Grubby::Scraper`

.scrapes(field, **options) { ... } ⇒ `void`

#[](field) ⇒ `Object`

#to_h ⇒ `Hash<Symbol, Object>`