Feedzirra

Description

Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb gem for faster http gets, and libxml through nokogiri and sax-machine for faster parsing.

It allows for easy customization of feed parsing options through the definition of custom parsing classes, and allows you to take as little or as much control as you want in updating feeds. Feedzirra makes it easy to figure out which content in feeds is new by storing the previous retrieval of a feed in a key-value store. Feedzirra uses the the “moneta” gem, which is a unified interface to key-value storage systems, in order to provide access to many different types of stores depending on your requirements.

Installation

For now Feedzirra exists only on github. It also has a few gem requirements that are only on github. Before you start you need to have libcurl and libxml installed. If you’re on Leopard you have both. Otherwise, you’ll need to grab them. Once you’ve got those libraries, you should be able to get up and running with the standard github gem install routine:

gem sources -a http://gems.github.com # if you haven't already
gem install jsl-feedzirra

Usage

This experimental branch offers a new interface to feed fetching with persistent back-end stores. This allows you to easily run a script retrieving the feeds once per hour or once per day, and it will remember which feeds have been seenbefore and which are new. This feature uses the Feedzirra::Reader interface.

You can create a Feedzirra::Reader object after the Feedzirra library (with require ‘feedzirra’) is loaded as follows:

reader = Feedzirra::Reader.new(‘www.woostercollective.com/rss/index.xml’) feed = reader.fetch

The Reader object can take a single URL or a list of URLs followed by a Hash of options. The options hash allows configuration of the backend store, as well as fetching options for the list of urls. Following is an example of configuration with the Memcache store connected to Tokyo Tyrant (the front-end for Tokyo Cabinet):

reader = Feedzirra::Reader.new(‘www.pauldix.net/atom.xml’, :backend => { :moneta_klass => ‘Moneta::Memcache’, :server => ‘localhost:1978’ })

Other options that may be put in the options hash follow the original API described below.

Running reader.fetch will first check the back-end store to see if this feed was fetched previously. If it was previously fetched, Feedzirra uses this information to avoid fetching the whole body if it has already been downloaded based on etag. If the feed has been updated, the new_entries will be populated based on the results of the last query. The back-end store will be updated with the results of every fetch, so Feedzirra will maintain state between executions. Feedzirra currently supports filesystem, memcache and a Ruby Hash structure-based back end that doesn’t attempt to persist any information.

Once you’ve retrieved a single feed, you can use the accessors below to query the results.

# feed and entries accessors
feed.title          # => "Paul Dix Explains Nothing"
feed.url            # => "http://www.pauldix.net"
feed.feed_url       # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
feed.etag           # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
feed.last_modified  # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object

entry = feed.entries.first
entry.title      # => "Ruby Http Client Library Performance"
entry.url        # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
entry.author     # => "Paul Dix"
entry.summary    # => "..."
entry.content    # => "..."
entry.published  # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
entry.categories # => ["...", "..."]

# sanitizing an entry's content
entry.title.sanitize   # => returns the title with harmful stuff escaped
entry.author.sanitize  # => returns the author with harmful stuff escaped
entry.content.sanitize # => returns the content with harmful stuff escaped
entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title)
entry.sanitize!         # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions)
feed.sanitize_entries!  # => sanitizes all entries in place

# updating a single feed
updated_feed = Feedzirra::Feed.update(feed)

# an updated feed has the following extra accessors
updated_feed.updated?     # returns true if any of the feed attributes have been modified. will return false if only new entries
updated_feed.new_entries  # a collection of the entry objects that are newer than the latest in the feed before update

# fetching multiple feeds
feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"]
feeds = Feedzirra::Reader.new(feed_urls).fetch

# feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown
# there will be a Fixnum of the http response code instead of a feed object

# updating multiple feeds.  if you're using a persistent back-end, Feedzirra uses that to determine which entries are ones that you haven't seen before
updated_feeds = Feedzirra::reader.new(urls).fetch

# defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler
feed = Feedzirra::Reader.new("http://feeds.feedburner.com/PaulDixExplainsNothing",
    :on_success => lambda {|feed| puts feed.title },
    :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body }).fetch

# if a collection was passed into the initializer, the handlers will be called for each one

Discussion

I’d like feedback on the api and any bugs encountered on feeds in the wild. I’ve set up a google group here.

Troubleshooting Installation

*NOTE:*Some people have been reporting a few issues related to installation. First, the Ruby Forge version of curb is not what you want. It will not work. Nor will the curl-multi gem that lives on Ruby Forge. You have to get the taf2-curb fork installed.

If you see this error when doing a require:

/Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- curb_core (LoadError)

It means that the taf2-curb gem didn’t build correctly. To resolve this you can do a git clone git://github.com/taf2/curb.git then run rake gem in the curb directory, then sudo gem install pkg/curb-0.2.4.0.gem. After that you should be good.

If you see something like this when trying to run it:

NoMethodError: undefined method `on_success' for #<Curl::Easy:0x1182724>
    from ./lib/feedzirra/feed.rb:88:in `add_url_to_multi'

This means that you are requiring curl-multi or the Ruby Forge version of Curb somewhere. You can’t use those and need to get the taf2 version up and running.

If you’re on Debian or Ubuntu and getting errors while trying to install the taf2-curb gem, it could be because you don’t have the latest version of libcurl installed. Do this to fix:

sudo apt-get install libcurl4-gnutls-dev

Another problem could be if you are running Mac Ports and you have libcurl installed through there. You need to uninstall it for curb to work! The version in Mac Ports is old and doesn’t play nice with curb. If you’re running Leopard, you can just uninstall and you should be golden. If you’re on an older version of OS X, you’ll then need to download curl and build from source. Then you’ll have to install the taf2-curb gem again. You might have to perform the step above.

If you’re still having issues, please let me know on the mailing list. Also, Todd Fisher (taf2) is working on fixing the gem install. Please send him a full error report.

TODO

This thing needs to hammer on many different feeds in the wild. I’m sure there will be bugs. I want to find them and crush them. I didn’t bother using the test suite for feedparser. i wanted to start fresh.

Here are some more specific TODOs.

Make a feedzirra-rails gem to integrate feedzirra seamlessly with Rails and ActiveRecord.
Add support for authenticated feeds.
Create a super sweet DSL for defining new parsers.
Test against Ruby 1.9.1 and fix any bugs.
I’m not keeping track of modified on entries. Should I add this?
Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead?

LICENSE

This library is provided under the MIT License. See the complete LICENSE in LICENSE.rdoc for details.