Class: Grubby
- Inherits:
-
Mechanize
- Object
- Mechanize
- Grubby
- Defined in:
- lib/grubby.rb
Defined Under Namespace
Classes: JsonParser, JsonScraper, PageScraper, Scraper
Constant Summary collapse
- VERSION =
GRUBBY_VERSION
Instance Attribute Summary collapse
-
#journal ⇒ Pathname?
Journal file used to ensure only-once processing of resources by #singleton across multiple program runs.
-
#time_between_requests ⇒ Integer, ...
The enforced minimum amount of time to wait between requests, in seconds.
Instance Method Summary collapse
-
#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...
Calls
#get
with each ofmirror_uris
until a successful (“200 OK”) response is recieved, and returns that#get
result. -
#initialize(journal = nil) ⇒ Grubby
constructor
A new instance of Grubby.
-
#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean
Calls
#head
and returns true if the result has response code “200”. -
#singleton(uri, purpose = "") {|resource| ... } ⇒ Boolean
Ensures only-once processing of the resource indicated by
uri
for the specifiedpurpose
.
Constructor Details
#initialize(journal = nil) ⇒ Grubby
Returns a new instance of Grubby.
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/grubby.rb', line 42 def initialize(journal = nil) super() # Prevent "memory leaks", and prevent mistakenly blank urls from # resolving. (Blank urls resolve as a path relative to the last # history entry. Without this setting, an erroneous `agent.get("")` # could sometimes successfully fetch a page.) self.max_history = 0 # Prevent files of unforeseen content type from being buffered into # memory by default, in case they are very large. However, increase # the threshold for what is considered "large", to prevent # unnecessary writes to disk. # # References: # - http://docs.seattlerb.org/mechanize/Mechanize/PluggableParser.html # - http://docs.seattlerb.org/mechanize/Mechanize/Download.html # - http://docs.seattlerb.org/mechanize/Mechanize/File.html self.max_file_buffer = 1_000_000 # only applies to Mechanize::Download self.pluggable_parser.default = Mechanize::Download self.pluggable_parser["text/plain"] = Mechanize::File self.pluggable_parser["application/json"] = Grubby::JsonParser # Set up configurable rate limiting, and choose a reasonable default # rate limit. self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) } self.post_connect_hooks << Proc.new do |agent, uri, response, body| self.send(:mark_last_request_time, (Time.now unless response.code.to_s.start_with?("3"))) end self.time_between_requests = 1.0 self.journal = journal end |
Instance Attribute Details
#journal ⇒ Pathname?
Journal file used to ensure only-once processing of resources by #singleton across multiple program runs.
37 38 39 |
# File 'lib/grubby.rb', line 37 def journal @journal end |
#time_between_requests ⇒ Integer, ...
The enforced minimum amount of time to wait between requests, in seconds. If the value is a Range, a random number within the Range is chosen for each request.
31 32 33 |
# File 'lib/grubby.rb', line 31 def time_between_requests @time_between_requests end |
Instance Method Details
#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...
Calls #get
with each of mirror_uris
until a successful (“200 OK”) response is recieved, and returns that #get
result. Rescues and logs Mechanize::ResponseCodeError
failures for all but the last mirror.
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/grubby.rb', line 131 def get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) i = 0 begin get(mirror_uris[i], parameters, referer, headers) rescue Mechanize::ResponseCodeError => e i += 1 if i >= mirror_uris.length raise else $log.debug("Mirror failed (code #{e.response_code}): #{mirror_uris[i - 1]}") $log.debug("Try mirror: #{mirror_uris[i]}") retry end end end |
#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean
Calls #head
and returns true if the result has response code “200”. Unlike #head
, error response codes (e.g. “404”, “500”) do not cause a Mechanize::ResponseCodeError
to be raised.
100 101 102 103 104 105 106 |
# File 'lib/grubby.rb', line 100 def ok?(uri, query_params = {}, headers = {}) begin head(uri, query_params, headers).code == "200" rescue Mechanize::ResponseCodeError false end end |
#singleton(uri, purpose = "") {|resource| ... } ⇒ Boolean
Ensures only-once processing of the resource indicated by uri
for the specified purpose
. A list of previously-processed resource URIs and content hashes is maintained in the Grubby instance. The given block is called with the fetched resource only if the resource’s URI and the resource’s content hash have not been previously processed under the specified purpose
.
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/grubby.rb', line 177 def singleton(uri, purpose = "") series = [] uri = uri.to_absolute_uri return if try_skip_singleton(uri, purpose, series) normalized_uri = normalize_uri(uri) return if try_skip_singleton(normalized_uri, purpose, series) $log.info("Fetch #{normalized_uri}") resource = get(normalized_uri) skip = try_skip_singleton(resource.uri, purpose, series) | try_skip_singleton("content hash: #{resource.content_hash}", purpose, series) yield resource unless skip CSV.open(journal, "a") do |csv| series.each{|singleton_key| csv << singleton_key } end if journal !skip end |