Simple Things
Syndicate: full/short
Article Calendar
<= April, 2014
Search this blog

Key links
External Blogs
Brought to you by ...

Bloom Filter

Andrew L. Johnson

I was recently looking up something in Knuth’s TAOCP vol. III, 2nd ed. and I came across mention of Burton H. Bloom and his data structure for set membership queries that now carries his name: Bloom filters. This structure has seen something of resurgence in the past few years, so I thought I’d contribute my own explanation and implementation.

Superimposed encoding techniques on secondary keys (attributes) had been around for quite some (manual card filing systems for example) when Bloom, in 1970, published an article that basically turned the idea on its side by considering an entire dataset as one record and the primary keys as the attributes. This provides the basis of a relatively compact structure (an m-bit field) that allows for fast membership queries on the primary with a probabilistically controlled rate of false positives.

What does that mean? Let’s start with something a little more concrete. We can simulate the stucture with a hole-punch, a strip of paper, and a Ruby irb shell open (to calculate a few numbers for us).

Let’s assume I want to make a set consisting of the names of myself and my three brothers: Andrew, Bradford, Gregory, and John. I can take a strip of paper with 12 small circles drawn on it numbered from 0 to 11 (this is a 12-bit field).

For each name, we will calculate two identifying numbers between 0 and 11 — for the purposes of this little simulation it will suffice to use Ruby’s String#sum and String#hash methods and use the modulo operator to reduce these to numbers between 0 and 11. For example, the two identifying numbers for my name in this situation are:

  "Andrew".sum  % 12    # => 9
  "Andrew".hash % 12    # => 7

We generally refer to such operations as hashing operations or functions — so our example is using two hashing functions. To add my name to the set, I just use my hole punch to punch out those two circles in the strip of paper:

And continuing by hashing Gregory, Bradford, and John we get

  "Bradford".sum  % 12    # => 0
  "Bradford".hash % 12    # => 5

  "Gregory".sum  % 12     # => 3
  "Gregory".hash % 12     # => 9

  "John".sum  % 12        # => 3
  "John".hash % 12        # => 4

And after a bit of hole-punching (some holes will have already been punched and that’s expected) we arrive at this strip of paper which represents our loaded Bloom filter:

Now, if I want to see if any old Tom, Dick, or Harry is in the Johnson Gang, I just calculate the 2 hashes for that name, punch out those two holes in a new strip of paper, then line it up with my master strip (the Bloom filter) and see if both holes align with existing holes in the master strip.

As we can see, neither Tom nor Harry are in the gang because at least one hole in each person’s strip doesn’t match up with a corresponding hole in the Bloom filter strip. However, Dick’s name just happens to hash to two holes in my Bloom filter. This is the false-positive problem mentioned earlier. In fact, using a 12-bit field to store 4 items by setting two bits per item results in a false positive rate of approximately 25%. Reducing this false positive rate is accomplished by increasing the size of the bit-field and utilizing more bits per item. For example, to get the false positive rate down to 10% we could use a 20-bit field with 3 hashes, and to reach 1% we’d use a 39-bit field and 7 hashes.

The way the math falls out for optimum performance with a given set size n and an acceptable rate err of false positives is to calculate the size of the bit-field and number of hashing functions as follows:

  m = (n * log(err) / log(1.0 / 2**log(2))).ceil
  k = (log(2) * m / n).round

Where m is the size of the bit-field and k is the number of hashes to use. So, for example, given n = 1000 here are a few data points for various error rates just to get a general idea of the relationships:

         err      m      k

        0.25     2886    2
        0.10     4793    3
        0.01     9586    7
        0.001   14378   10
        0.0001  19171   13

One of the assumptions is a good set of hashing functions — and MD5 or SHA1 (with various salts) are common choices — for a good random distribtion of the k-bits per key. However, a recent paper demonstrates that distributing k-bits using only 2 hashes and a simple cycling algorithm can achieve results comparable to using k high-quality hashing functions. This can lead to definite speedups as k grows and numerous keys are checked against the filter.

My Ruby implementation utilizes the above mentioned algorithm — the core of the implementation resides in the private bits_on method which takes a key and computes two hashes (aka MD5 digests) and then cycles through k bit positions yielding each position in turn. The add and has? methods call bits_on and supply blocks to set or test the bit positions respectively:

  def add(*keys)
    keys.flatten.each do |key|
      raise RangeError, "Too many keys" if (self.nkeys += 1) > capacity
      bits_on(key) {|pos|[pos] = 1}
  alias :insert :add

  def has?(key)
    bits_on(key){|pos|[pos] == 1 or return false}

  # yields each bit-position for a given key
  def bits_on(key)
    pos1, pos2 ={|s|Digest[s + key.to_s].hex % bits}
    hashes.times do |i|
      yield pos1
      pos1 = (pos1 + pos2) % bits
      pos2 = (pos2 + i)  % bits
  private :bits_on

On a simple test using a smallish dictionary of n = 38619 words and a false-positive rate of err = 0.0001, this implementation was better than an order of magnitude faster than the pre-existing Ruby version or the Perl version from which it was derived. I’ve run a number of ad-hoc tests and confirmed that the false positive rate does not appear to be affected by the dependency introduced in the k-bit positions.

Traditionally, Bloom filters were used in memory constrained applications, such as a spell-checker when available RAM couldn’t hold a dictionary of appreciable size. More recently, Bloom filters have found uses in network routing, cache-summary, and peer-peer applications (see this survey article, or read about loaf for a more specific example).

Further Resources

Here are a few further resources — my implementation drew heavily from the C# version given in the literate programming example at the end of the following links:



Simplicity via Complexity

Andrew L. Johnson

…If you aim to dispense with method, learn method
…If you aim at facility, work hard
…If you aim for simplicity, master complexity
   — Lu Ch’ai (Wang Kai), [chinese painter, ca 17 century]

The economy and simplicity of traditional oriental painting is not arrived at by squinting at the subject in order to blur or filter out the complexity — subjects are studied in detail (with both eye and mind). The simplicity and elegance are distilled from the far side of that process.

Consider the tale/fable of the gentleman who entered a well known chinese painter’s shop and described a painting he would like to have painted. The painter talked the painting over with him over a cup of tea, thought for a few minutes and said he could do such a painting but he wouldn’t have it ready for a year. The gentleman frowned at such a delay, but this painter came highly recommended, so he paid a downpayment and left the shop.

A year passed and the gentleman returns and asks the painter if his painting is ready. Without a word, the painter rolls out a new length of rice paper, grinds some fresh ink, and proceeds to paint the picture before the gentleman’s very eyes. In a few minutes he is finished. The gentleman is stunned, and quite impressed that this picture, painted in just a few minutes, captured exactly what he had wanted. "Don’t get me wrong", the gentleman spoke, "the painting is exquisite. But it took you less time to paint than we spent drinking tea a year ago, could you not have found the few minutes needed to paint it any sooner?"

The painter went to a cupboard and retrieved a large box full of hundreds of discarded versions of the masterpiece now drying on the table. "Which of these", the painter replied, "would you have wished to have sooner?"

Well, there may be as many versions of this fable as there were discarded paintings in the box, and, Damn, could that painter estimate a project or what!? But project estimation isn’t my point. Try this one:

I would not give a fig for the simplicity this side of complexity,
but I would give my life for the simplicity on the other side of
complexity.    — Justice Oliver Wendell Holmes (1809-1894)

I think we can all recognize a certain Truth in these quotes that applies to any human activity — and certainly to software development. Getting over the complexity divide takes work. You can’t just skirt around it, and you shouldn’t fear it.

Behind complexity, there is always simplicity to be revealed. Inside simplicity, there is always complexity to be discovered.   — Gang Yu


Oniguruma (Ruby with Demon wheels)

Andrew L Johnson

Oniguruma is a regular expression C library you can use in your own projects under the BSD license, or you can install it as Ruby’s regular expression engine (in which case it falls under the Ruby license). Oniguruma may be translated to English as Demon Wheel (or something along those lines).

Oniguruma is slated to become Ruby’s default regular expression engine, and Ruby 1.9 already has it included. But you don’t have to wait to try it out — it is easily incorporated into 1.8* ruby builds and basically just involves:

  1. downloading and unpacking the latest oniguruma sources for Ruby
  2. configure oniguruma with your Ruby source directory
  3. make oniguruma (which applies the patches to the Ruby sources)
  4. rebuild and test your ruby (make clean;make;make test) in Ruby directory
  5. test oniguruma (make test) in oniguruma directory

The only danger in doing this is forgetting that oniguruma is not yet standard Ruby and shouldn’t be a dependency in released code. You might want to build both a standard ruby and an oni-ruby (or perhaps guru-ruby).

Oniguruma brings several features to Ruby’s regexen, notably:

  • positive and negative look-behind
  • possessive quantifiers (like atomic/independent subexpressions but as quantifier)
  • named backreferences
  • callable backreferences

Look-behind and callable backreferences are probably the main reasons you’d want to install oniguruma.


Look-ahead assertions have been around for some time, in many regular expression flavors. Look-behind assertions are less prevalent. Oniguruma brings positive and negative look-behind assertions ((?<=…) and (?<!…) respectively) to Ruby. Just like look-ahead assertions, these are zero-width assertions — they match the current position if the assertion about what follows (look-aheads) or precedes (look-behinds) is true. They do not consume any part of the string.

Unlike look-ahead assertions, look-behinds must contain fixed-width patterns which means: no indeterminate quantifiers. However, alternation is allowed at the top level of the look-behind, and the alternations need not be of the same fixed width. Capturing is allowed within positive look-behinds, but not in negative look-behinds (which makes sense).

Callable Backreferences

Callable backreferences give us recursively defined regular expressions, which allow one to match/extract arbitrarily nested balanced parentheses (or other delimiters).

  # to match a group of nested unescaped parentheses:

  re = %r/((?<pg>\((?:\\[()]|[^()]|\g<pg>)*\)))/
  s = 'some(stri\)\((()x)(((c)d)e)\))ng'
  mt = s.match re
  puts mt[1]

    ==> (stri\)\((()x)(((c)d)e)\))

Difference between Oniguruma and Standard Ruby Regular Expressions

The main behavioral difference I’ve noted between the two regular expression engines involves capturing with zero-length subexpression matches. In the following, sruby is standard ruby, and oruby is compiled with oniguruma:

  $ sruby -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  $ oruby -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'

No difference there, but note that Perl handles this differently (and, IMHO more correctly):

  $ perl -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'

In my mind, with nested capturing such as this I would expect that the contents of $2 and $3 would be substrings (even if only empty strings) of $1 — like Perl handles it. However, Ruby isn’t alone in that Python and the pcre both handle it as Ruby does.

If this behavior doesn’t seem strange, consider this more obvious example:

  $ sruby -e '"ba" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  $ oruby -e '"ba" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'

I understand the interpretation, I just don’t think it is the most correct interpretation to follow.

The difference between Oniguruma and Ruby becomes apparent in the following example, when the subexpressions themselves may be zero-length:

  $ sruby -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  $ oruby -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  $ perl -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'

Here, Oniguruma sides with Perl instead of Ruby, and all the captured subexpressions are the empty string. However, pcre agrees with standard Ruby on this one, and Python won’t even compile the regular expression.

  Versions used in testing:
    sruby       => Ruby 1.8.4 (2006-01-21)
    oruby       => Ruby 1.8.4 (2006-01-21) with Oniguruma 2.5.2
    Perl 5.8.7
    Python 2.4
    pcre 6.3


Asynchronous Sequential Messages

Andrew L. Johnson

The Io language has asynchronous messages that return transparent futures — asynchronous messages are put in a per-object queue and processed sequentially by a lightweight thread (coroutine). The return value is a transparent future which _turns into_ the actual result when it becomes available (using the future blocks if the result isn’t ready). I thought it might be interesting to POC the idea in Ruby code:

  require 'thread'
  class Async
    @@keep =  %w-__id__ __send__-
    (instance_methods - @@keep).each{|m| undef_method m}

    def initialize(&blk)
      @th =
      @th.abort_on_exception = true
      at_exit {@th.join}

    def method_missing(sym, *args, &blk)
      __getobj__.__send__(sym, *args, &blk)

    def __getobj__
      @obj ||= @th.value

    # control/status messages
    def arun

    def await

    def aready?
      ! @th.alive?

    def aresult_to(obj,meth) {obj.send(meth,__getobj__)}

  class Object
    def async(msg,*args,&blk)
      @async_queue ||=
      fut = {Thread.stop;self.send(msg,*args,&blk)}
      @async_queue << fut
      @async_thread ||= do
        loop{@async_queue.pop.arun.await; Thread.pass}

Each object gets its own queue for asynchronous messages, which are handled in turn (FIFO). Thread control is explicitly passed upon completion of an asynchronous message (control can also be passed within an asynchronous message). The return value is a proxy-future that will block when used (aside from a few control messages) until the result is ready and then proxy that result. This version also allows for the result of an asynchronous message to be automatically passed to another object via the aresult_to method.

There is probably more wrong than right with this proof-of-concept (deadlock, exceptions, garbage collection, etc.). Still, it was a cute little exercise — and I should mention that I freely borrowed ideas from Jim Weirich’s BlankSlate and kjana’s UDelegator (with its own versions of futures and promises).



PLEAC -- Ruby version still growing (slowly)

Andrew L. Johnson

Progress is slowly creeping forward on the Ruby version of PLEAC (Programming Languages Alike Cookbook), which is an attempt to represent the recipes in the Perl Cookbook in a variety of languages. I’ve been aware of the project for a couple of years, but only got around to joining the mailing list a few weeks ago.

I’ve submitted some 20+ Ruby sections in the last couple of weeks, trying to top off a few early chapters to the 100% mark. Currently, chapters 1,2,3,4,9, and 10 are done. Chapter 5 is missing an ordered hash recipe (should be coming soon). Chapter 6 (patterns) is just over 81% now, and chapter 7 should be close to 60% once a few recent submissions are accepted and committed. Chapter 8 sits at 33% at the moment. The first half (chapters 1-10) of the Ruby version could be brought to 100% in a relatively short time (though a few longish programs are still needed).

There are also significant inroads in many of the chapters in the second half (chapters 11-20). It might be a nice boon for Ruby to become the first language to achieve a 100% complete PLEAC (at the moment Ruby sits at 58.71% of completion, behind only Python’s 59.14%) hint hint :-)