Extraloop Redis Storage
Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).
gem install extraloop-redis-storage
Extraloop's Redis storage module decorates
ExtraLoop::IterativeScraper instances with the
set_storage method: a helper method that allows to specify how
the scraped data should be stored.
require "extraloop/redis-storage" class AmazonReview < ExtraLoop::Storage::Record attribute :title attribute :rank attribute :date def validate assert (0..5).include?(rank.to_i), "Rank not in range" end end scraper = AmazonReviewScraper.new("0262560992"). .set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'") .run()
At each scraper run, the ExtraLoop storage module internally instantiates a
associates the extracted records to it. The `AmazonReview` records just
created, can now be accessed by calling the `#records` metod on scraper
reviews = scraper.session.records
set_storage method accepts the following arguments:
model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing
session_title A human readable title for the extracted dataset (optional).
Command line interface
Once installed, the gem will also add to your system path the
extraloop executable: a command line interface to the datasets
harvested through ExtraLoop. A list of datasets can be obtained by running:
extraloop datastore list
This will generate a table like the following one:
id | title | model | records -------------------------------------------------------------------- 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110 49 | 1330106948 AmazonReview Dataset | AmazonReview | 0 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110 52 | 1330111630 AmazonReview Dataset | AmazonReview | 10
Datasets can be removed using the
extraloop datastore delete [id]
id is either a single scraping session id, or a session
id range (e.g. 48..52).
From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:
extraloop datastore export 51..52 -f csv
Similarly, stored datasets can be uploaded to a remote datastore:
extraloop datastore push 51..48 fusion_tables -c google_username:password