Class: Grubby

Inherits:

Mechanize

Object
Mechanize
Grubby

show all

Defined in:: lib/grubby.rb

Defined Under Namespace

Classes: JsonParser, JsonScraper, PageScraper, Scraper

Constant Summary collapse

VERSION =

GRUBBY_VERSION

Instance Attribute Summary collapse

#journal ⇒ Pathname^?

Journal file used to ensure only-once processing of resources by #singleton across multiple program runs.
#time_between_requests ⇒ Integer, ...

The enforced minimum amount of time to wait between requests, in seconds.

Instance Method Summary collapse

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...

Calls #get with each of mirror_uris until a successful (“200 OK”) response is recieved, and returns that #get result.
#initialize(journal = nil) ⇒ Grubby constructor

A new instance of Grubby.
#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean

Calls #head and returns true if the result has response code “200”.
#singleton(uri, purpose = "") {|resource| ... } ⇒ Boolean

Ensures only-once processing of the resource indicated by uri for the specified purpose.

Constructor Details

#initialize(journal = nil) ⇒ `Grubby`

Returns a new instance of Grubby.

Parameters:

journal (Pathname, String) (defaults to: nil) —

Optional journal file used to ensure only-once processing of resources by #singleton across multiple program runs.

# File 'lib/grubby.rb', line 42

def initialize(journal = nil)
  super()

  # Prevent "memory leaks", and prevent mistakenly blank urls from
  # resolving.  (Blank urls resolve as a path relative to the last
  # history entry.  Without this setting, an erroneous `agent.get("")`
  # could sometimes successfully fetch a page.)
  self.max_history = 0

  # Prevent files of unforeseen content type from being buffered into
  # memory by default, in case they are very large.  However, increase
  # the threshold for what is considered "large", to prevent
  # unnecessary writes to disk.
  #
  # References:
  #   - http://docs.seattlerb.org/mechanize/Mechanize/PluggableParser.html
  #   - http://docs.seattlerb.org/mechanize/Mechanize/Download.html
  #   - http://docs.seattlerb.org/mechanize/Mechanize/File.html
  self.max_file_buffer = 1_000_000 # only applies to Mechanize::Download
  self.pluggable_parser.default = Mechanize::Download
  self.pluggable_parser["text/plain"] = Mechanize::File
  self.pluggable_parser["application/json"] = Grubby::JsonParser

  # Set up configurable rate limiting, and choose a reasonable default
  # rate limit.
  self.pre_connect_hooks << Proc.new{ self.send(:sleep_between_requests) }
  self.post_connect_hooks << Proc.new do |agent, uri, response, body|
    self.send(:mark_last_request_time, (Time.now unless response.code.to_s.start_with?("3")))
  end
  self.time_between_requests = 1.0

  self.journal = journal
end

Instance Attribute Details

#journal ⇒ `Pathname`^?

Journal file used to ensure only-once processing of resources by #singleton across multiple program runs.

Returns:

(Pathname, nil)



37
38
39

# File 'lib/grubby.rb', line 37

def journal
  @journal
end

#time_between_requests ⇒ `Integer`, ...

The enforced minimum amount of time to wait between requests, in seconds. If the value is a Range, a random number within the Range is chosen for each request.

Returns:

(Integer, Float, Range<Integer>, Range<Float>)



31
32
33

# File 'lib/grubby.rb', line 31

def time_between_requests
  @time_between_requests
end

Instance Method Details

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ `Mechanize::Page`, ...

Calls #get with each of mirror_uris until a successful (“200 OK”) response is recieved, and returns that #get result. Rescues and logs Mechanize::ResponseCodeError failures for all but the last mirror.

Examples:

grubby = Grubby.new

urls = [
  "http://httpstat.us/404",
  "http://httpstat.us/500",
  "http://httpstat.us/200#foo",
  "http://httpstat.us/200#bar",
]

grubby.get_mirrored(urls).uri  # == URI("http://httpstat.us/200#foo")

grubby.get_mirrored(urls.take(2))  # raise Mechanize::ResponseCodeError

Parameters:

mirror_uris (Array<URI>, Array<String>)

Returns:

(Mechanize::Page, Mechanize::File, Mechanize::Download, ...)

Raises:

(Mechanize::ResponseCodeError) —

if all mirror_uris fail

# File 'lib/grubby.rb', line 131

def get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {})
  i = 0
  begin
    get(mirror_uris[i], parameters, referer, headers)
  rescue Mechanize::ResponseCodeError => e
    i += 1
    if i >= mirror_uris.length
      raise
    else
      $log.debug("Mirror failed (code #{e.response_code}): #{mirror_uris[i - 1]}")
      $log.debug("Try mirror: #{mirror_uris[i]}")
      retry
    end
  end
end

#ok?(uri, query_params = {}, headers = {}) ⇒ `Boolean`

Calls #head and returns true if the result has response code “200”. Unlike #head, error response codes (e.g. “404”, “500”) do not cause a Mechanize::ResponseCodeError to be raised.

Parameters:

uri (URI, String)

Returns:

(Boolean)

# File 'lib/grubby.rb', line 100

def ok?(uri, query_params = {}, headers = {})
  begin
    head(uri, query_params, headers).code == "200"
  rescue Mechanize::ResponseCodeError
    false
  end
end

#singleton(uri, purpose = "") {|resource| ... } ⇒ `Boolean`

Ensures only-once processing of the resource indicated by uri for the specified purpose. A list of previously-processed resource URIs and content hashes is maintained in the Grubby instance. The given block is called with the fetched resource only if the resource’s URI and the resource’s content hash have not been previously processed under the specified purpose.

Examples:

grubby = Grubby.new

grubby.singleton("https://example.com/foo") do |page|
  # will be executed (first time "/foo")
end

grubby.singleton("https://example.com/foo#bar") do |page|
  # will be skipped (already seen "/foo")
end

grubby.singleton("https://example.com/foo", "again!") do |page|
  # will be executed (new purpose for "/foo")
end

Parameters:

uri (URI, String)
purpose (String) (defaults to: "")

Yields:

(resource)

Yield Parameters:

resource (Mechanize::Page, Mechanize::File, Mechanize::Download, ...)

Returns:

(Boolean) —

whether the given block was called

Raises:

(Mechanize::ResponseCodeError) —

if fetching the resource results in error (see Mechanize#get)

# File 'lib/grubby.rb', line 177

def singleton(uri, purpose = "")
  series = []

  uri = uri.to_absolute_uri
  return if try_skip_singleton(uri, purpose, series)

  normalized_uri = normalize_uri(uri)
  return if try_skip_singleton(normalized_uri, purpose, series)

  $log.info("Fetch #{normalized_uri}")
  resource = get(normalized_uri)
  skip = try_skip_singleton(resource.uri, purpose, series) |
    try_skip_singleton("content hash: #{resource.content_hash}", purpose, series)

  yield resource unless skip

  CSV.open(journal, "a") do |csv|
    series.each{|singleton_key| csv << singleton_key }
  end if journal

  !skip
end

Class: Grubby

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(journal = nil) ⇒ Grubby

Instance Attribute Details

#journal ⇒ Pathname?

#time_between_requests ⇒ Integer, ...

Instance Method Details

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ Mechanize::Page, ...

#ok?(uri, query_params = {}, headers = {}) ⇒ Boolean

#singleton(uri, purpose = "") {|resource| ... } ⇒ Boolean

#initialize(journal = nil) ⇒ `Grubby`

#journal ⇒ `Pathname`^?

#time_between_requests ⇒ `Integer`, ...

#get_mirrored(mirror_uris, parameters = [], referer = nil, headers = {}) ⇒ `Mechanize::Page`, ...

#ok?(uri, query_params = {}, headers = {}) ⇒ `Boolean`

#singleton(uri, purpose = "") {|resource| ... } ⇒ `Boolean`