frequent-algorithm
Web site usage, social network behavior and Internet traffic are examples of systems that appear to follow the Power law, where most of the events are due to the actions of a very small few. Knowing at any given point in time which items are trending is valuable in understanding the system.
frequent-algorithm
is a Ruby implementation of the FREQUENT algorithm
for identifying frequent items in a data stream in sliding windows.
Please refer to Identifying Frequent Items in Sliding Windows over On-Line
Packet Streams, by
Golab, DeHaan, Demaine, López-Ortiz and Munro (2003).
Introduction
Challenges
Challenges for Real-time processing of data streams for frequent item queries include:
- data may be of unknown and possibly unbound length
- data may be arriving a very fast rate
- it might not be possible to go back and re-read the data
- too large a window of observation may include stale data
Therefore, a solution should have the following characteristics:
- uses limited memory
- can process events in the stream in Ο(1) constant time
- requires only a single-pass over the data
The algorithm
LOOP
- For each element e in the next b elements:
If a local counter exists for the type of element e:
Increment the local counter.
Otherwise:
Create a new local counter for this element type
and set it equal to 1.- Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.
- Delete all local counters
- For each type named in S:
If a global counter exists for this type:
Add to it the count recorded in S.
Otherwise:
Create a new global counter for this element type
and set it equal to the count recorded in S.- Add the count of the kth largest type in S to δ.
- If sizeOf(Q) > N/b:
(a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.
(b) For all element types named in S':
Subtract from their global counters the counts
recorded in S'
If a counter is decremented to zero:
Delete it.
(c) Output the identity and value of each global counter > δ.— Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003
Usage
require 'frequent-algorithm'
# data is pi to 1000 digits
pi = File.read('test/frequent/test_data_pi').strip
data = pi.scan(/./).each_slice(b)
N = 100 # size of main window
b = 20 # size of basic window
k = 3 # we are interested in top-3 numerals in pi
alg = Frequent::Algorithm.new(N, b, k)
# read in and process the 1st basic window
alg.process(data.next)
# and the top-3 numerals are?
top3 = alg.statistics.report
puts top3
# lather, rinse and repeat
alg.process(data.next)
Development
The development of this gem requires the following:
- Ruby 1.9.3 or greater
- rubygems
bundler
rake
minitest
(unit testing)yard
(documentation)rdiscount
(Markdown)
Building, testing and release of this rubygem uses the following
rake
commands:
rake clean # Remove any temporary products
rake clobber # Remove any generated file
rake test # Execute unit tests
rake build # Build frequent-algorithm-n.n.n.gem into the pkg directory
rake install # Build and install frequent-algorithm-n.n.n.gem into system gems
rake release # Create tag vn.n.n and build and push
# frequent-algorithm-n.n.n.gem to Rubygems
Documentation
frequent-algorithm
uses yard
and
rdiscount
for Markdown documentation.
Check out Getting Started with
Yard.
Unit Testing
frequent-algorithm
uses
MiniTest::Unit
for
unit testing.
Releasing
Please refer to Publishing To Rubygems.org in the Rubygems Guide.
Contributing
- Fork it
- Begin work on
dev-branch
(git fetch && git checkout dev-branch
) - Create your feature branch (
git branch my-new-feature && git checkout my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature:dev-branch
) - Create new Pull Request
You may wish to read the Git book online.
Changelog
Please see the CHANGELOG for this gem's release history.
License
frequent-algorithm is provided under the terms of the MIT license.
Copyright © 2015, Willie Tong & Brooke M. Fujita. All rights reserved. Please see the LICENSE file for further details.