Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc0
Article Calendar
<= July, 2014
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Oniguruma (Ruby with Demon wheels)

Andrew L Johnson

Oniguruma is a regular expression C library you can use in your own projects under the BSD license, or you can install it as Ruby’s regular expression engine (in which case it falls under the Ruby license). Oniguruma may be translated to English as Demon Wheel (or something along those lines).

Oniguruma is slated to become Ruby’s default regular expression engine, and Ruby 1.9 already has it included. But you don’t have to wait to try it out — it is easily incorporated into 1.8* ruby builds and basically just involves:

  1. downloading and unpacking the latest oniguruma sources for Ruby
  2. configure oniguruma with your Ruby source directory
  3. make oniguruma (which applies the patches to the Ruby sources)
  4. rebuild and test your ruby (make clean;make;make test) in Ruby directory
  5. test oniguruma (make test) in oniguruma directory

The only danger in doing this is forgetting that oniguruma is not yet standard Ruby and shouldn’t be a dependency in released code. You might want to build both a standard ruby and an oni-ruby (or perhaps guru-ruby).

Oniguruma brings several features to Ruby’s regexen, notably:

  • positive and negative look-behind
  • possessive quantifiers (like atomic/independent subexpressions but as quantifier)
  • named backreferences
  • callable backreferences

Look-behind and callable backreferences are probably the main reasons you’d want to install oniguruma.

Look-Behinds

Look-ahead assertions have been around for some time, in many regular expression flavors. Look-behind assertions are less prevalent. Oniguruma brings positive and negative look-behind assertions ((?<=…) and (?<!…) respectively) to Ruby. Just like look-ahead assertions, these are zero-width assertions — they match the current position if the assertion about what follows (look-aheads) or precedes (look-behinds) is true. They do not consume any part of the string.

Unlike look-ahead assertions, look-behinds must contain fixed-width patterns which means: no indeterminate quantifiers. However, alternation is allowed at the top level of the look-behind, and the alternations need not be of the same fixed width. Capturing is allowed within positive look-behinds, but not in negative look-behinds (which makes sense).

Callable Backreferences

Callable backreferences give us recursively defined regular expressions, which allow one to match/extract arbitrarily nested balanced parentheses (or other delimiters).

  # to match a group of nested unescaped parentheses:

  re = %r/((?<pg>\((?:\\[()]|[^()]|\g<pg>)*\)))/
  s = 'some(stri\)\((()x)(((c)d)e)\))ng'
  mt = s.match re
  puts mt[1]

    ==> (stri\)\((()x)(((c)d)e)\))

Difference between Oniguruma and Standard Ruby Regular Expressions

The main behavioral difference I’ve noted between the two regular expression engines involves capturing with zero-length subexpression matches. In the following, sruby is standard ruby, and oruby is compiled with oniguruma:

  $ sruby -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba::a:b
  $ oruby -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba::a:b

No difference there, but note that Perl handles this differently (and, IMHO more correctly):

  $ perl -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  #{aba}:#{}:#{}:#{}

In my mind, with nested capturing such as this I would expect that the contents of $2 and $3 would be substrings (even if only empty strings) of $1 — like Perl handles it. However, Ruby isn’t alone in that Python and the pcre both handle it as Ruby does.

If this behavior doesn’t seem strange, consider this more obvious example:

  $ sruby -e '"ba" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  ba::a:b
  $ oruby -e '"ba" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  ba::a:b

I understand the interpretation, I just don’t think it is the most correct interpretation to follow.

The difference between Oniguruma and Ruby becomes apparent in the following example, when the subexpressions themselves may be zero-length:

  $ sruby -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba::a:b
  $ oruby -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba:::
  $ perl -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  #{aba}:#{}:#{}:#{}

Here, Oniguruma sides with Perl instead of Ruby, and all the captured subexpressions are the empty string. However, pcre agrees with standard Ruby on this one, and Python won’t even compile the regular expression.

  Versions used in testing:
    sruby       => Ruby 1.8.4 (2006-01-21)
    oruby       => Ruby 1.8.4 (2006-01-21) with Oniguruma 2.5.2
    Perl 5.8.7
    Python 2.4
    pcre 6.3

__END__