frequent-algorithm

Web site usage, social network behavior and Internet traffic are examples of systems that appear to follow the Power law, where most of the events are due to the actions of a very small few. Knowing at any given point in time which items are trending is valuable in understanding the system.

frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows. Please refer to Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, by Golab, DeHaan, Demaine, López-Ortiz and Munro (2003).

License Build Status Gem Version

Introduction

Challenges

Challenges for Real-time processing of data streams for frequent item queries include:

  • data may be of unknown and possibly unbound length
  • data may be arriving a very fast rate
  • it might not be possible to go back and re-read the data
  • too large a window of observation may include stale data

Therefore, a solution should have the following characteristics:

  • uses limited memory
  • can process events in the stream in Ο(1) constant time
  • requires only a single-pass over the data

The algorithm

LOOP

  1. For each element e in the next b elements:
        If a local counter exists for the type of element e:
          Increment the local counter.
        Otherwise:
          Create a new local counter for this element type
          and set it equal to 1.
  2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.
  3. Delete all local counters
  4. For each type named in S:
        If a global counter exists for this type:
          Add to it the count recorded in S.
        Otherwise:
          Create a new global counter for this element type
          and set it equal to the count recorded in S.
  5. Add the count of the kth largest type in S to δ.
  6. If sizeOf(Q) > N/b:
        (a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.
        (b) For all element types named in S':
          Subtract from their global counters the counts
          recorded in S'
          If a counter is decremented to zero:
            Delete it.
        (c) Output the identity and value of each global counter > δ.

Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003

Usage

require 'frequent-algorithm'

# data is pi to 1000 digits
pi = File.read('test/frequent/test_data_pi').strip
data = pi.scan(/./).each_slice(b)

N = 100  # size of main window
b =  20  # size of basic window
k =   3  # we are interested in top-3 numerals in pi

alg = Frequent::Algorithm.new(N, b, k) 

# read in and process the 1st basic window
alg.process(data.next)

# and the top-3 numerals are?
top3 = alg.statistics.report
puts top3

# lather, rinse and repeat
alg.process(data.next)

Development

The development of this gem requires the following:

Building, testing and release of this rubygem uses the following rake commands:

rake clean    # Remove any temporary products
rake clobber  # Remove any generated file
rake test     # Execute unit tests
rake build    # Build frequent-algorithm-n.n.n.gem into the pkg directory
rake install  # Build and install frequent-algorithm-n.n.n.gem into system gems
rake release  # Create tag vn.n.n and build and push
              # frequent-algorithm-n.n.n.gem to Rubygems

Documentation

frequent-algorithm uses yard and rdiscount for Markdown documentation. Check out Getting Started with Yard.

Unit Testing

frequent-algorithm uses MiniTest::Unit for unit testing.

Releasing

Please refer to Publishing To Rubygems.org in the Rubygems Guide.

Contributing

  1. Fork it
  2. Begin work on dev-branch (git fetch && git checkout dev-branch)
  3. Create your feature branch (git branch my-new-feature && git checkout my-new-feature)
  4. Commit your changes (git commit -am 'Add some feature')
  5. Push to the branch (git push origin my-new-feature:dev-branch)
  6. Create new Pull Request

You may wish to read the Git book online.

Changelog

Please see the CHANGELOG for this gem's release history.

License

frequent-algorithm is provided under the terms of the MIT license.

Copyright © 2015, Willie Tong & Brooke M. Fujita. All rights reserved. Please see the LICENSE file for further details.