Class: Grubby::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/grubby/scraper.rb

Direct Known Subclasses

JsonScraper, PageScraper

Defined Under Namespace

Classes: Error, FieldValueRequiredError

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Scraper

Returns a new instance of Scraper.

Parameters:

  • source

Raises:



206
207
208
209
210
211
212
213
214
215
216
217
218
219
# File 'lib/grubby/scraper.rb', line 206

def initialize(source)
  @source = source
  @scraped = {}
  @errors = {}

  self.class.fields.each do |field|
    begin
      self.send(field)
    rescue FieldScrapeFailedError
    end
  end

  raise Error.new(self) unless @errors.empty?
end

Instance Attribute Details

#errorsHash<Symbol, StandardError> (readonly)

Collected errors raised during #initialize by blocks passed to scrapes, indexed by field name. If #initialize did not raise Grubby::Scraper::Error, this Hash will be empty.

Returns:

  • (Hash<Symbol, StandardError>)


201
202
203
# File 'lib/grubby/scraper.rb', line 201

def errors
  @errors
end

#sourceObject (readonly)

The object being scraped. Typically a Mechanize pluggable parser such as Mechanize::Page.

Returns:

  • (Object)


194
195
196
# File 'lib/grubby/scraper.rb', line 194

def source
  @source
end

Class Method Details

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

This method returns an undefined value.

Iterates a series of pages, starting at start. The Scraper class is instantiated with each page, and each instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method method on each previous scraper instance.

Iteration stops when the next_method method returns nil. If the next_method method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.

Examples:

class PostsIndexScraper < Grubby::PageScraper
  scrapes(:page_param){ page.uri.query_param("page") }

  def next
    page.link_with(text: "Next >")&.click
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page_param  # == "1", "2", "3", ...
end
class PostsIndexScraper < Grubby::PageScraper
  scrapes(:page_param){ page.uri.query_param("page") }

  scrapes(:next_uri, optional: true) do
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
  scraper.page_param  # == "1", "2", "3", ...
end

Parameters:

Yields:

  • (scraper)

Yield Parameters:

Raises:

  • (NoMethodError)

    if Scraper class does not implement next_method



174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# File 'lib/grubby/scraper.rb', line 174

def self.each(start, agent = $grubby, next_method: :next)
  unless self.method_defined?(next_method)
    raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
  end

  return to_enum(:each, start, agent, next_method: next_method) unless block_given?

  current = start
  while current
    current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
    scraper = self.new(current)
    yield scraper
    current = scraper.send(next_method)
  end
end

.fieldsArray<Symbol>

Fields defined by scrapes.

Returns:

  • (Array<Symbol>)


94
95
96
# File 'lib/grubby/scraper.rb', line 94

def self.fields
  @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup
end

.scrape(url, agent = $grubby) ⇒ Grubby::Scraper

Instantiates the Scraper class with the resource specified by url. This method acts as a default factory method, and provides a standard interface for specialized overrides.

Examples:

Default factory method

class PostPageScraper < Grubby::PageScraper
  # ...
end

PostPageScraper.scrape("https://example.com/posts/42")
  # == PostPageScraper.new($grubby.get("https://example.com/posts/42"))

Specialized factory method

class PostApiScraper < Grubby::JsonScraper
  # ...

  def self.scrapes(url, agent = $grubby)
    api_url = url.sub(%r"//example.com/(.+)", '//api.example.com/\1.json')
    super(api_url, agent)
  end
end

PostApiScraper.scrape("https://example.com/posts/42")
  # == PostApiScraper.new($grubby.get("https://api.example.com/posts/42.json"))

Parameters:

  • url (String, URI)
  • agent (Mechanize) (defaults to: $grubby)

Returns:



126
127
128
# File 'lib/grubby/scraper.rb', line 126

def self.scrape(url, agent = $grubby)
  self.new(agent.get(url))
end

.scrapes(field, **options) { ... } ⇒ void

This method returns an undefined value.

Defines an attribute reader method named by field. During initialize, the given block is called, and the attribute is set to the block’s return value.

By default, if the block’s return value is nil, an exception will be raised. To prevent this behavior, specify optional: true.

The block may also be evaluated conditionally, based on another method’s return value, using the :if or :unless options.

Examples:

class GreetingScraper < Grubby::Scraper
  scrapes(:salutation) do
    source[/\A(hello|good morning)\b/i]
  end

  scrapes(:recipient, optional: true) do
    source[/\A#{salutation} ([a-z ]+)/i, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.salutation  # == "Hello"
scraper.recipient   # == "World"

scraper = GreetingScraper.new("Good morning!")
scraper.salutation  # == "Good morning"
scraper.recipient   # == nil

scraper = GreetingScraper.new("Hey!")  # raises Grubby::Scraper::Error
class EmbeddedUrlScraper < Grubby::Scraper
  scrapes(:url, optional: true){ source[%r"\bhttps?://\S+"] }

  scrapes(:domain, if: :url){ url[%r"://([^/]+)/", 1] }
end

scraper = EmbeddedUrlScraper.new("visit https://example.com/foo for details")
scraper.url     # == "https://example.com/foo"
scraper.domain  # == "example.com"

scraper = EmbeddedUrlScraper.new("visit our website for details")
scraper.url     # == nil
scraper.domain  # == nil

Parameters:

  • field (Symbol, String)
  • options (Hash)

Options Hash (**options):

  • :optional (Boolean)
  • :if (Symbol)
  • :unless (Symbol)

Yields:

Yield Returns:

  • (Object)


57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/grubby/scraper.rb', line 57

def self.scrapes(field, **options, &block)
  field = field.to_sym
  (self.fields << field).uniq!

  define_method(field) do
    raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)

    if !@scraped.key?(field) && !@errors.key?(field)
      begin
        skip = (options[:if] && !self.send(options[:if])) ||
          (options[:unless] && self.send(options[:unless]))

        if skip
          @scraped[field] = nil
        else
          @scraped[field] = instance_eval(&block)
          if @scraped[field].nil?
            raise FieldValueRequiredError.new(field) unless options[:optional]
            $log.debug("#{self.class}##{field} is nil")
          end
        end
      rescue RuntimeError, IndexError => e
        @errors[field] = e
      end
    end

    if @errors.key?(field)
      raise FieldScrapeFailedError.new(field, @errors[field])
    else
      @scraped[field]
    end
  end
end

Instance Method Details

#[](field) ⇒ Object

Returns the scraped value named by field.

Parameters:

Returns:

  • (Object)

Raises:

  • (RuntimeError)

    if field is not a valid name



227
228
229
# File 'lib/grubby/scraper.rb', line 227

def [](field)
  @scraped.fetch(field.to_sym)
end

#to_hHash<Symbol, Object>

Returns all scraped values as a Hash.

Returns:

  • (Hash<Symbol, Object>)


234
235
236
# File 'lib/grubby/scraper.rb', line 234

def to_h
  @scraped.dup
end