space-saver-redis

This gem is a pure Ruby implementation of Metwally, Agrawal, and Abbadi's SpaceSaver algorithm for estimating the top K elements in a data stream. A Redis instance is used for storage.

Here's an example:

require 'redis'
require 'space-saver-redis'

# Estimate the top 10 most frequent items seen in a data stream
space_saver = SpaceSaver.new(Redis.new, 10)

urls_visited.each { |url| space_saver.increment("urls", url) }

After the above code executes, you can query space_saver.leaders("urls") to get an estimate of the top 10 most frequent URLs visited along with their estimated counts. The SpaceSaver instance uses only a Redis sorted set with at most K elements (10, in this case) at any time to make this estimation.

Obviously, since the data structure uses only a small, fixed amount of space, there are some data distributions that can cause the top K elements returned to be completely incorrect, but for a lot of data distributions the results are worth the savings in space. In particular, for a SpaceSaver instance initialized with parameter K that observes a data stream of N items, any item that occurs more than N/K times is guaranteed to be in the list of estimated leaders.

One way to cope with the error involved in this kind of estimation is to use a K bigger than you actually need and then truncate the number of leaders returned at query time. You can pass an additional parameter to the call to leaders to do this, for example space_saver.leaders("urls", 3) will return only the top 3 of the 10 estimated most frequent items.

Installation

gem install space-saver-redis