Bloomer: A pure-ruby bloom filter with no extra fluff

Bloom filters are great for quickly checking to see if a given string has been seen before--in constant time, and using a fixed amount of RAM.

Note that false positives with bloom filters are possible, but false negatives are not. In other words,

  • if include? returns false, that string has certainly not been added
  • if include? returns true, it might mean that string was added (depending on the false_positive_probability parameter provided to the constructor).

This implementation is the Nth bloom filter gem written in ruby -- but, at the time of conception, the only one that

  • uses a robust set of hashing functions
  • can marshal state quickly
  • does not require EM or Redis or something else unrelated to simply implementing a bloom filter

Usage

expected_size = 10_000
false_positive_probability = 0.01
b = Bloomer.new(expected_size, false_positive_probability)
b.add "cat"
b.include? "cat"
#=> true
bf.include? "dog"
#=> false

Serialization is through Marshal:

b = Bloomer.new(10)
b.add("a")
s = Marshal.dump(b)
new_b = Marshal.load(s)
new_b.include? "a"
#=> true

History

  • 0.0.1 Bloom, there it is.

  • 0.0.2 Switch to triple hash chaining, which resulted in better, faster hashing (!!):

md5 (v0.0.2): 66 sec, false positive rate = 1.116%, expected 1.0% multihash (0.0.1): 92 sec, false positive rate = 1.27%, expected 1.0%