I was recently looking up something in Knuth’s TAOCP vol. III, 2nd ed. and I came across mention of Burton H. Bloom and his data structure for set membership queries that now carries his name: Bloom filters. This structure has seen something of resurgence in the past few years, so I thought I’d contribute my own explanation and implementation.

Superimposed encoding techniques on secondary keys (attributes) had been around for quite some (manual card filing systems for example) when Bloom, in 1970, published an article that basically turned the idea on its side by considering an entire dataset as one record and the primary keys as the attributes. This provides the basis of a relatively compact structure (an m-bit field) that allows for fast membership queries on the primary with a probabilistically controlled rate of false positives.

What does that mean? Let’s start with something a little more
concrete. We can simulate the stucture with a hole-punch, a strip of paper,
and a Ruby `irb` shell open (to calculate a few numbers for us).

Let’s assume I want to make a set consisting of the names of myself and my three brothers: Andrew, Bradford, Gregory, and John. I can take a strip of paper with 12 small circles drawn on it numbered from 0 to 11 (this is a 12-bit field).

For each name, we will calculate two identifying numbers between 0 and 11
— for the purposes of this little simulation it will suffice to use
Ruby’s `String#sum` and `String#hash` methods and use
the modulo operator to reduce these to numbers between 0 and 11. For
example, the two identifying numbers for my name in this situation are:

"Andrew".sum % 12 # => 9 "Andrew".hash % 12 # => 7

We generally refer to such operations as *hashing* operations or
functions — so our example is using two hashing functions. To add my
name to the set, I just use my hole punch to punch out those two circles in
the strip of paper:

And continuing by hashing Gregory, Bradford, and John we get

"Bradford".sum % 12 # => 0 "Bradford".hash % 12 # => 5 "Gregory".sum % 12 # => 3 "Gregory".hash % 12 # => 9 "John".sum % 12 # => 3 "John".hash % 12 # => 4

And after a bit of hole-punching (some holes will have already been punched and that’s expected) we arrive at this strip of paper which represents our loaded Bloom filter:

Now, if I want to see if any old Tom, Dick, or Harry is in the Johnson Gang, I just calculate the 2 hashes for that name, punch out those two holes in a new strip of paper, then line it up with my master strip (the Bloom filter) and see if both holes align with existing holes in the master strip.

As we can see, neither Tom nor Harry are in the gang because at least one hole in each person’s strip doesn’t match up with a corresponding hole in the Bloom filter strip. However, Dick’s name just happens to hash to two holes in my Bloom filter. This is the false-positive problem mentioned earlier. In fact, using a 12-bit field to store 4 items by setting two bits per item results in a false positive rate of approximately 25%. Reducing this false positive rate is accomplished by increasing the size of the bit-field and utilizing more bits per item. For example, to get the false positive rate down to 10% we could use a 20-bit field with 3 hashes, and to reach 1% we’d use a 39-bit field and 7 hashes.

The way the math falls out for optimum performance with a given set size
** n** and an acceptable rate

**of false positives is to calculate the size of the bit-field and number of hashing functions as follows:**

*err*m = (n * log(err) / log(1.0 / 2**log(2))).ceil k = (log(2) * m / n).round

Where ** m** is the size of the bit-field and

**is the number of hashes to use. So, for example, given**

*k***= 1000 here are a few data points for various error rates just to get a general idea of the relationships:**

*n*err m k 0.25 2886 2 0.10 4793 3 0.01 9586 7 0.001 14378 10 0.0001 19171 13

One of the assumptions is a good set of hashing functions — and MD5
or SHA1 (with various salts) are common choices — for a good random
distribtion of the k-bits per key. However, a recent paper demonstrates that
distributing k-bits using only 2 hashes and a simple cycling algorithm can
achieve results comparable to using ** k** high-quality hashing
functions. This can lead to definite speedups as

**grows and numerous keys are checked against the filter.**

*k*
My Ruby
implementation utilizes the above mentioned algorithm — the core
of the implementation resides in the private `bits_on` method which
takes a key and computes two hashes (aka MD5 digests) and then cycles
through k bit positions yielding each position in turn. The `add`
and `has?` methods call `bits_on` and supply blocks to set or
test the bit positions respectively:

def add(*keys) keys.flatten.each do |key| raise RangeError, "Too many keys" if (self.nkeys += 1) > capacity bits_on(key) {|pos| self.bv[pos] = 1} end end alias :insert :add def has?(key) bits_on(key){|pos| self.bv[pos] == 1 or return false} true end # yields each bit-position for a given key def bits_on(key) pos1, pos2 = salts.map{|s|Digest[s + key.to_s].hex % bits} hashes.times do |i| yield pos1 pos1 = (pos1 + pos2) % bits pos2 = (pos2 + i) % bits end end private :bits_on

On a simple test using a smallish dictionary of ** n** = 38619
words and a false-positive rate of

**= 0.0001, this implementation was better than an order of magnitude faster than the pre-existing Ruby version or the Perl version from which it was derived. I’ve run a number of ad-hoc tests and confirmed that the false positive rate does not appear to be affected by the dependency introduced in the k-bit positions.**

*err*Traditionally, Bloom filters were used in memory constrained applications, such as a spell-checker when available RAM couldn’t hold a dictionary of appreciable size. More recently, Bloom filters have found uses in network routing, cache-summary, and peer-peer applications (see this survey article, or read about loaf for a more specific example).

### Further Resources

Here are a few further resources — my implementation drew heavily from the C# version given in the literate programming example at the end of the following links:

- wikipedia link
- some math and tips
- more math
- Bloom Filters in Probabilistic Verification
- a literate programming example

__END__