Description

Feedtosis fetches RSS and Atom feeds with an easy-to-use interface. It uses FeedNormalizer for parsing, and Curb for fetching. It helps by automatically using conditional HTTP GET requests as well as by reliably pointing out which entries are new in any given feed.

Feedtosis is designed to help you with book-keeping about feed fetching details so that things like using HTTP conditional GET are trivial. It has a simple interface, and remains a lightweight component that delegates to FeedNormalizer for parsing feeds and the fantastic taf2-curb library for fetching feeds.

Installation

Assuming that you've followed the directions on gems.github.com to allow your computer to install gems from GitHub, the following command will install the Feedtosis library:

sudo gem install jsl-feedtosis

Usage

Feedtosis is easy to use. Just create a client object, and invoke the “fetch” method:

require 'feedtosis' 
client = Feedtosis::Client.new('http://feeds.feedburner.com/wooster') 
result = client.fetch

result will be a Feedtosis::Result object which delegates methods to the FeedNormalizer::Feed object as well as the Curl::Easy object used to fetch the feed. Useful methods on this object include entries, new_entries and response_code among many others (basically all of the methods that FeedNormalizer::Feed and Curl::Easy objects respond to are implemented and can be called directly, minus the setter methods for these objects).

Note that since Feedtosis uses HTTP conditional GET, it may not actually have received a full XML response from the server suitable for being parsed into entries. In this case, methods such as entries on the Feedtosis::Result will return nil. Depending on your application logic, you may want to inspect the methods that are delegated to the Curl::Easy object, such as response_code, for more information on what happened in these cases.

Remember that a response code of 304 means “Not Modified”. In this case, you should expect “entries” and “new_entries” to be nil, since the resource wasn't downloaded according to the logic of HTTP conditional GET.

On subsequent requests of a particular resource, Feedtosis will update new_entries to contain the feed entries that we haven't seen yet. In most applications, your program will probably call the same batch of URLS multiple times, and process the elements in new_entries.

You will most likely want to allow Feedtosis to remember details about the last retrieval of a feed after the client is removed from memory. Feedtosis uses Moneta, a unified interface to key-value storage systems to remember “summaries” of feeds that it has seen in the past. See the document section on Customization for more details on how to configure this system.

Customization

Feedtosis stores summaries of feeds in a key-value storage system. If no options are included when creating a new Feedtosis::Client object, the default is to use a “memory” storage system. The memory system is just a basic ruby Hash, so it won't keep track of feeds after a particular Client is removed from memory. To configure a different backend, pass an options hash to the Feedtosis client initialization:

url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/south_asia/rss.xml" 
f = Feedtosis::Client.new(url, :backend => Moneta::Memcache.new(:server => 'localhost:1978')) 
res = f.fetch

This example sets up a Memcache backend, which in this case points to Tokyo Tyrant on port 1978.

Generally, Feedtosis supports all systems supported by Moneta, and any one of the supported systems can be given to the moneta_klass parameter. Other options following backend are passed directly to Moneta for configuration.

Implementation

Feedtosis helps to identify new feed entries and to figure out when conditional GET can be used in retrieving resources. In order to accomplish this without having to require that the user store information such as etags and dates of the last retrieved entry, Feedtosis stores a summary structure in the configured key-value store (backed by Moneta). In order to do conditional GET requests, Feedtosis stores the Last-Modified date, as well as the ETag of the last request in the summary structure, which is put in a namespaced element consisting of the term 'Feedtosis' (bet you won't have to worry about name collisions on that one!) and the MD5 of the URL retrieved.

It can also be a bit tricky to decipher which feed entries are new since many feed sources don't include unique ids with their feeds. Feedtosis reliably keeps track of which entries in a feed are new by storing (in the summary hash mentioned above) an MD5 signature of each entry in a feed. It takes elements such as the published-at date, title and content and generates the MD5 of these elements. This allows Feedtosis to cheaply compute (both in terms of computation and storage) which feed entries should be presented to the user as “new”. Below is an example of a summary structure:

{ 
  :etag => "4c8f-46ac09fbbe940", 
  :last_modified => "Mon, 25 May 2009 18:17:33 GMT",
  :digests => [["f2993783ded928637ce5f2dc2d837f10", "da64efa6dd9ce34e5699b9efe73a37a7"]]
}

The data stored by Feedtosis in the summary structure allows it to be helpful to the user without storing lots of data that are unnecessary for efficient functioning.

The summary structure keeps an Array of Arrays containing digests of feeds. The reason for this is that some feeds, such as the Google blog search feeds, contain slightly different but often-recurring results in the result set. Feedtosis keeps complete sets of entry digests for previous feed retrievals. The number of digest sets that will be kept is configurable by setting the option :retained_digest_size on Feedtosis client initialization.

HTML cleaning/sanitizing

Feedtosis doesn't do anything about feed sanitizing, as other libraries have been built for this purpose. FeedNormalizer has methods for escaping entries, but to strip HTML I suggest that you look at the Ruby gem “sanitize”.

Credits

Thanks to Sander Hartlage (GitHub: Sander6) for useful feedback early in the development of Feedtosis.

Feedback

Please let me know if you have any problems with or questions about Feedtosis.

Author

Justin S. Leitgeb, justin@phq.org