Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News0
Programming0
LanguageBits0
Perl0
Ruby1
VersionControl0
Misc0
Article Calendar
<= March, 2006
S M T W T F S
1234
567891011
12131415161718
19202122232425
262728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Bloom Filter

Andrew L. Johnson

I was recently looking up something in Knuth’s TAOCP vol. III, 2nd ed. and I came across mention of Burton H. Bloom and his data structure for set membership queries that now carries his name: Bloom filters. This structure has seen something of resurgence in the past few years, so I thought I’d contribute my own explanation and implementation.

Superimposed encoding techniques on secondary keys (attributes) had been around for quite some (manual card filing systems for example) when Bloom, in 1970, published an article that basically turned the idea on its side by considering an entire dataset as one record and the primary keys as the attributes. This provides the basis of a relatively compact structure (an m-bit field) that allows for fast membership queries on the primary with a probabilistically controlled rate of false positives.

What does that mean? Let’s start with something a little more concrete. We can simulate the stucture with a hole-punch, a strip of paper, and a Ruby irb shell open (to calculate a few numbers for us).

Let’s assume I want to make a set consisting of the names of myself and my three brothers: Andrew, Bradford, Gregory, and John. I can take a strip of paper with 12 small circles drawn on it numbered from 0 to 11 (this is a 12-bit field).

For each name, we will calculate two identifying numbers between 0 and 11 — for the purposes of this little simulation it will suffice to use Ruby’s String#sum and String#hash methods and use the modulo operator to reduce these to numbers between 0 and 11. For example, the two identifying numbers for my name in this situation are:

  "Andrew".sum  % 12    # => 9
  "Andrew".hash % 12    # => 7

We generally refer to such operations as hashing operations or functions — so our example is using two hashing functions. To add my name to the set, I just use my hole punch to punch out those two circles in the strip of paper:

And continuing by hashing Gregory, Bradford, and John we get

  "Bradford".sum  % 12    # => 0
  "Bradford".hash % 12    # => 5

  "Gregory".sum  % 12     # => 3
  "Gregory".hash % 12     # => 9

  "John".sum  % 12        # => 3
  "John".hash % 12        # => 4

And after a bit of hole-punching (some holes will have already been punched and that’s expected) we arrive at this strip of paper which represents our loaded Bloom filter:

Now, if I want to see if any old Tom, Dick, or Harry is in the Johnson Gang, I just calculate the 2 hashes for that name, punch out those two holes in a new strip of paper, then line it up with my master strip (the Bloom filter) and see if both holes align with existing holes in the master strip.

As we can see, neither Tom nor Harry are in the gang because at least one hole in each person’s strip doesn’t match up with a corresponding hole in the Bloom filter strip. However, Dick’s name just happens to hash to two holes in my Bloom filter. This is the false-positive problem mentioned earlier. In fact, using a 12-bit field to store 4 items by setting two bits per item results in a false positive rate of approximately 25%. Reducing this false positive rate is accomplished by increasing the size of the bit-field and utilizing more bits per item. For example, to get the false positive rate down to 10% we could use a 20-bit field with 3 hashes, and to reach 1% we’d use a 39-bit field and 7 hashes.

The way the math falls out for optimum performance with a given set size n and an acceptable rate err of false positives is to calculate the size of the bit-field and number of hashing functions as follows:

  m = (n * log(err) / log(1.0 / 2**log(2))).ceil
  k = (log(2) * m / n).round

Where m is the size of the bit-field and k is the number of hashes to use. So, for example, given n = 1000 here are a few data points for various error rates just to get a general idea of the relationships:

         err      m      k

        0.25     2886    2
        0.10     4793    3
        0.01     9586    7
        0.001   14378   10
        0.0001  19171   13

One of the assumptions is a good set of hashing functions — and MD5 or SHA1 (with various salts) are common choices — for a good random distribtion of the k-bits per key. However, a recent paper demonstrates that distributing k-bits using only 2 hashes and a simple cycling algorithm can achieve results comparable to using k high-quality hashing functions. This can lead to definite speedups as k grows and numerous keys are checked against the filter.

My Ruby implementation utilizes the above mentioned algorithm — the core of the implementation resides in the private bits_on method which takes a key and computes two hashes (aka MD5 digests) and then cycles through k bit positions yielding each position in turn. The add and has? methods call bits_on and supply blocks to set or test the bit positions respectively:

  def add(*keys)
    keys.flatten.each do |key|
      raise RangeError, "Too many keys" if (self.nkeys += 1) > capacity
      bits_on(key) {|pos| self.bv[pos] = 1}
    end
  end
  alias :insert :add

  def has?(key)
    bits_on(key){|pos| self.bv[pos] == 1 or return false}
    true
  end

  # yields each bit-position for a given key
  def bits_on(key)
    pos1, pos2 = salts.map{|s|Digest[s + key.to_s].hex % bits}
    hashes.times do |i|
      yield pos1
      pos1 = (pos1 + pos2) % bits
      pos2 = (pos2 + i)  % bits
    end
  end
  private :bits_on

On a simple test using a smallish dictionary of n = 38619 words and a false-positive rate of err = 0.0001, this implementation was better than an order of magnitude faster than the pre-existing Ruby version or the Perl version from which it was derived. I’ve run a number of ad-hoc tests and confirmed that the false positive rate does not appear to be affected by the dependency introduced in the k-bit positions.

Traditionally, Bloom filters were used in memory constrained applications, such as a spell-checker when available RAM couldn’t hold a dictionary of appreciable size. More recently, Bloom filters have found uses in network routing, cache-summary, and peer-peer applications (see this survey article, or read about loaf for a more specific example).

Further Resources

Here are a few further resources — my implementation drew heavily from the C# version given in the literate programming example at the end of the following links:

Feedback

__END__