Module: Unbreakable

Defined in:
lib/unbreakable.rb,
lib/unbreakable/scraper.rb,
lib/unbreakable/version.rb,
lib/unbreakable/observers/log.rb,
lib/unbreakable/decorators/timeout.rb,
lib/unbreakable/observers/observer.rb,
lib/unbreakable/processors/transform.rb,
lib/unbreakable/data_storage/file_data_store.rb

Overview

When using this gem, you’ll start by defining a Scraper, with methods for retrieving and processing data. The data will be stored in DataStorage; this gem currently provides only a FileDataStore. You may enhance a datastore with Decorators and Observers: for example, a Timeout decorator to retry on timeout with exponential backoff and a Log observer which logs retrieval progress. Of course, you must also define a Processor to turn your raw data into machine-readable data.

A simple skeleton scraper:

require 'unbreakable'

class MyScraper < Unbreakable::Scraper
  def retrieve(args)
    # download all the documents
  end
  def processable
    # return a list of documents to process
  end
end

class MyProcessor < Unbreakable::Processors::Transform
  def perform
    # return the transformed record as a hash, array, etc.
  end
  def persist(arg)
    # store the hash/array/etc. in Mongo, MySQL, YAML, etc.
  end
end

scraper = MyScraper.new
scraper.processor.register MyProcessor
scraper.configure do |c|
  # configure the scraper
end
scraper.run(ARGV)

Every scraper script can run as a command-line script. Try it!

$ ruby myscraper.rb
usage: irb [options] <command> [<args>]

The most commonly used commands are:
    retrieve  Cache remote files to the datastore for later processing
    process   Process cached files into machine-readable data
    config    Print the current configuration

Specific options:
        --root_path ARG              default "/var/tmp/unbreakable"
        --[no-]store_meta            default true
        --cache_duration ARG         default 31536000
        --fallback_mime_type ARG     default "application/octet-stream"
        --secret ARG                 default "secret yo"
        --[no-]trust_file_extensions default true

General options:
    -h, --help                       Display this screen

Defined Under Namespace

Modules: DataStorage, Decorators, Observers, Processors Classes: InvalidRemoteFile, Scraper, UnbreakableError

Constant Summary collapse

VERSION =
"0.0.5"