Class: Grubby::Scraper
- Inherits:
-
Object
- Object
- Grubby::Scraper
- Defined in:
- lib/grubby/scraper.rb
Direct Known Subclasses
Defined Under Namespace
Classes: Error, FieldValueRequiredError
Instance Attribute Summary collapse
-
#errors ⇒ Hash<Symbol, StandardError>
readonly
Collected errors raised during #initialize by blocks passed to Scraper.scrapes, indexed by field name.
-
#source ⇒ Object
readonly
The object being scraped.
Class Method Summary collapse
-
.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void
Iterates a series of pages, starting at
start. -
.fields ⇒ Array<Symbol>
Fields defined by Scraper.scrapes.
-
.scrape(url, agent = $grubby) ⇒ Grubby::Scraper
Instantiates the Scraper class with the resource specified by
url. -
.scrapes(field, **options) { ... } ⇒ void
Defines an attribute reader method named by
field.
Instance Method Summary collapse
-
#[](field) ⇒ Object
Returns the scraped value named by
field. -
#initialize(source) ⇒ Scraper
constructor
A new instance of Scraper.
-
#to_h ⇒ Hash<Symbol, Object>
Returns all scraped values as a Hash.
Constructor Details
#initialize(source) ⇒ Scraper
Returns a new instance of Scraper.
206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
# File 'lib/grubby/scraper.rb', line 206 def initialize(source) @source = source @scraped = {} @errors = {} self.class.fields.each do |field| begin self.send(field) rescue FieldScrapeFailedError end end raise Error.new(self) unless @errors.empty? end |
Instance Attribute Details
#errors ⇒ Hash<Symbol, StandardError> (readonly)
Collected errors raised during #initialize by blocks passed to scrapes, indexed by field name. If #initialize did not raise Grubby::Scraper::Error, this Hash will be empty.
201 202 203 |
# File 'lib/grubby/scraper.rb', line 201 def errors @errors end |
#source ⇒ Object (readonly)
The object being scraped. Typically a Mechanize pluggable parser such as Mechanize::Page.
194 195 196 |
# File 'lib/grubby/scraper.rb', line 194 def source @source end |
Class Method Details
.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void
This method returns an undefined value.
Iterates a series of pages, starting at start. The Scraper class is instantiated with each page, and each instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method method on each previous scraper instance.
Iteration stops when the next_method method returns nil. If the next_method method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/grubby/scraper.rb', line 174 def self.each(start, agent = $grubby, next_method: :next) unless self.method_defined?(next_method) raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`" end return to_enum(:each, start, agent, next_method: next_method) unless block_given? current = start while current current = agent.get(current) if current.is_a?(String) || current.is_a?(URI) scraper = self.new(current) yield scraper current = scraper.send(next_method) end end |
.fields ⇒ Array<Symbol>
Fields defined by scrapes.
94 95 96 |
# File 'lib/grubby/scraper.rb', line 94 def self.fields @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup end |
.scrape(url, agent = $grubby) ⇒ Grubby::Scraper
Instantiates the Scraper class with the resource specified by url. This method acts as a default factory method, and provides a standard interface for specialized overrides.
126 127 128 |
# File 'lib/grubby/scraper.rb', line 126 def self.scrape(url, agent = $grubby) self.new(agent.get(url)) end |
.scrapes(field, **options) { ... } ⇒ void
This method returns an undefined value.
Defines an attribute reader method named by field. During initialize, the given block is called, and the attribute is set to the block’s return value.
By default, if the block’s return value is nil, an exception will be raised. To prevent this behavior, specify optional: true.
The block may also be evaluated conditionally, based on another method’s return value, using the :if or :unless options.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/grubby/scraper.rb', line 57 def self.scrapes(field, **, &block) field = field.to_sym (self.fields << field).uniq! define_method(field) do raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped) if !@scraped.key?(field) && !@errors.key?(field) begin skip = ([:if] && !self.send([:if])) || ([:unless] && self.send([:unless])) if skip @scraped[field] = nil else @scraped[field] = instance_eval(&block) if @scraped[field].nil? raise FieldValueRequiredError.new(field) unless [:optional] $log.debug("#{self.class}##{field} is nil") end end rescue RuntimeError, IndexError => e @errors[field] = e end end if @errors.key?(field) raise FieldScrapeFailedError.new(field, @errors[field]) else @scraped[field] end end end |
Instance Method Details
#[](field) ⇒ Object
Returns the scraped value named by field.
227 228 229 |
# File 'lib/grubby/scraper.rb', line 227 def [](field) @scraped.fetch(field.to_sym) end |
#to_h ⇒ Hash<Symbol, Object>
Returns all scraped values as a Hash.
234 235 236 |
# File 'lib/grubby/scraper.rb', line 234 def to_h @scraped.dup end |