Siaris

Regular Expressions Tutorial, Part 4: More Quantifiers
04 Jan 01 - http://www.siaris.net/index.cgi/Programming/LanguageBits/Perl/20010104.rdoc

In our first tutorial we introduced the * quantifier which means match zero or more of the previous regex item. Perl’s regular expressions have several additional quantifiers and we will explore them here.

The quantifier most resembling the * in both appearance and action is the + quantifier — this one means match one or more of the previous regex item. In other words, if we wanted to match one more vowel characters we could do the following:

    /[aeiou]+/   which is equivelant to  /[aeiou][aeiou]*/

There is one other single-character quantifier — the ? symbol which means zero or one of the previous regex item. Think of this one as simply being an optional match. If we want to match something resembling numbers we need to be able to grab the integer portion and optionally the decimal portion if it exists:

    $_ = 'here is an integer 123 and a floating point 123.456 number';
    while ( /(\d+(\.\d+)?)/g ) {
        print "Found: $1\n";
    }

The regex reads: match one or more digits optionally followed by a decimal point and one or more digits.

So far our quantifiers have been rather indeterminate — that is, we haven’t been able to say how many items to match, only a minimum number to match (zero in the case of * and ?, and one in the case of the + quantifier). Perl regular expressions have another form of quantifier that allows you to match a certain number of times, or to set specific lower and upper limits on the number of times you want to match a certain item. I call this the bounded quantifier.

The basic form of the bounded quantifier is {n,m}, where the n stands for the lower bound, and the m stands for the upper bound. So, the pattern /a{2,5}/ means match an ‘a’ at least two times but no more than 5 times. Like all of the quantifiers I’ve mentioned, this is greedy and will match as many as it can within the given bounds (we will consider greediness in the next installment).

Variations of the bounded quantifier are as follows:

    {n,m}   match at least n times, but no more than m times
    {n,}    match n or more times
    {n}     match exactly n times

From these forms you can see that the three previous quantifiers (*, +, ?) all have some equivelant form of representation using the bounded quantifier:

    *  ==  {0,}  zero or more
    +  ==  {1,}  one or more
    ?  ==  {0,1} zero or one

As I mentioned, all of these quantifiers are referred to as ‘greedy’ quantifiers — given a choice they will match the maximum number of times they can while still allowing the rest of the regex to match. Next week we will look closer at what greediness means, and how we can use non-greedy forms of the above quantifiers.