Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= July, 2008
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Regular Expression Tutorial Part 6: Capturing, Grouping, and Backreferences

Andrew L. Johnson (First published by ItWorld.com 2001-01-18)

In this final installment of our regular expression tutorial we will look at capturing, grouping, and backreferences. We have looked at some of these briefly before, but I couldn’t very well have a multi-part tutorial without including them here.

Parentheses are used for both grouping subexpressions in a pattern, and capturing the text matched for later use in the special digit variables ($1, $2, $3, … etc). For example:

    $_ = 'this string has a bar in it';
    m/has a (foo|bar)/;
    print "$1\n";   # prints: bar

As we saw in the first installment, the grouping effect also limits the scope of the alternation operator. In the above, we match the sequence ‘has a ’ followed by either a ‘foo’ or ‘bar’ sequence and we capture whatever was matched in the parenthesized sub-pattern into the $1 variable.

We can also achieve grouping without capturing by using the grouping only construct: (?:pattern). Sometimes we do not wish to capture a particular subexpression, but merely group it for use with a quantifier, or alternator. This construct can be more efficient because there is no special variable associated with it that must be updated during the match.

We know we can capture something and use it later, either in the replacement side of a s/// operation, or in any expression that follows the match. We can also refer to the text previously matched within the same regular expression by using a backslash followed by the number of the subexpression you want to match again. For example, if we want to read through a wordlist (one word per line) and print out every word that contains two repeated letters we can do this:

    while(<>){
        print if /(\w)\1/;
    }

This means if we find a \w character immediately followed by the same character that was matched (not just followed by any other \w character) then we print that line. Consider how you might try to do that without backreferences. Here’s a very interesting (but not efficient) program for printing out prime numbers using backreferences in a regex (based on a posted to comp.lang.perl.misc by Abigail some time ago):

    #!/usr/bin/perl -w
    use strict;
    my $number = $ARGV[0] || 100;
    for(my $n = 2; $n <= $number; $n++){
        $_ = '1' x $n;
        print "$n is prime\n" unless /^(11+)\1+$/;
    }

You pass it a positive integer on the command line and it will print all the primes up to that number (by default it will print the primes up to 100). All the testing of divisibility is done by the regex engine attempting to match equal groupings of the digit ‘1’ from the start to end of the string.

*****

Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification

Andrew L. Johnson (First published by ItWorld.com 2001-01-11)

Last week I mentioned that the standard quantifiers are greedy. This week we will look at non-greedy forms of quantifiers, but first let’s discuss just what it means to be greedy.

    my $string = 'bcdabdcbabcd';
    $string =~ m/^(.*)ab/;
    print "$1\n";                # prints: bcdabdcb

The * is greedy and therefore the .* portion of the regex will match as much as it can and still allow the remainder of the regex to match. In this case it will match everything up to the last ‘ab’ in the string. Actually, the .* will match right to the end of the string, and then start backing up until it can match an ‘ab’ (this is called backtracking).

To make the quantifier non-greedy you simply follow it with a ’?’ symbol:

    my $string = 'bcdabdcbabcd';
    $string =~ m/^(.*?)ab/;
    print "$1\n";                # prints: bcd

In this case the .*? portion attempts to match the least amount of data while allowing the remainder of the regex to match. Here the regex engine will match the beginning of the string then it will try to match zero of anything and check to see if the rest can match (that fails), next it will match the ‘b’ and then check again if the ‘ab’ can match (still fails). This continues until the the .*? has matched the first 3 characters and then the following ‘ab’ is matched.

You can make any of the standard quantifiers that aren’t exact non-greedy by appending a ’?’ symbol to them: *?, +?, ??, {n,m}?, and {n,}?.

One thing to watch out for: given a pattern such as /^(.*?)%(.*?)/ one could match and extract the first two fields of a like of % separated data:

    #!/usr/bin/perl -w
    use strict;
    $_ = 'Johnson%Andrew%AX321%37';
    m/^(.*?)%(.*?)%/;
    print "$2 $1\n";

And one can easily begin to think of each subexpression as meaning ‘match up to the next % symbol’, but that isn’t exactly what it means. Let’s say that the third field represents an ID tag and we want to extract only those names of people with ID tags starting with ‘A’. We might be tempted to do this:

    #!/usr/bin/perl -w
    use strict;
    while (<DATA>) {
        print "$2 $1\n" if m/^(.*?)%(.*?)%A/;
    }
    __DATA__
    Johnson%Andrew%AX321%Manitoba
    Smith%John%BC142%Alberta

This would print out:

    Andrew Johnson
    John%BC142 Smith

But that isn’t what we wanted at all — what happened? Well, the second half of the regex does not say match up to the next % symbol and then match an ‘A’, it says, match up to the next % symbol that is followed by an ‘A’. The pattern ’(.*?)’ part is not prevented from matching and proceeding past a % character if that is what is necessary to cause the whole regex to succeed. What we really wanted in this case was a negated character class:

    #!/usr/bin/perl -w
    use strict;
    while (<DATA>) {
        print "$2 $1\n" if m/^([^%]*)%([^%]*)%A/;
    }
    __DATA__
    Johnson%Andrew%AX321%Manitoba
    Smith%John%BC142%Alberta

Now we are saying exactly what we want: the first subexpression grabs zero or more of anything except a % character, then we match a % character, then the second subexpression also grabs zero or more of anything but a % character, and finally we match ’%A’ or we fail.

To summarize, a greedy quantifier takes as much as it can get, and a non-greedy quantifier takes as little as possible (in both cases only while still allowing the entire regex to succeed). Take care in how you use non-greedy quantifiers — it is easy to get fooled into using one where a negated character class is more appropriate.

*****

Regular Expressions Tutorial, Part 4: More Quantifiers

Andrew L. Johnson (First published by ItWorld.com 2001-01-04)

In our first tutorial we introduced the * quantifier which means match zero or more of the previous regex item. Perl’s regular expressions have several additional quantifiers and we will explore them here.

The quantifier most resembling the * in both appearance and action is the + quantifier — this one means match one or more of the previous regex item. In other words, if we wanted to match one more vowel characters we could do the following:

    /[aeiou]+/   which is equivelant to  /[aeiou][aeiou]*/

There is one other single-character quantifier — the ? symbol which means zero or one of the previous regex item. Think of this one as simply being an optional match. If we want to match something resembling numbers we need to be able to grab the integer portion and optionally the decimal portion if it exists:

    $_ = 'here is an integer 123 and a floating point 123.456 number';
    while ( /(\d+(\.\d+)?)/g ) {
        print "Found: $1\n";
    }

The regex reads: match one or more digits optionally followed by a decimal point and one or more digits.

So far our quantifiers have been rather indeterminate — that is, we haven’t been able to say how many items to match, only a minimum number to match (zero in the case of * and ?, and one in the case of the + quantifier). Perl regular expressions have another form of quantifier that allows you to match a certain number of times, or to set specific lower and upper limits on the number of times you want to match a certain item. I call this the bounded quantifier.

The basic form of the bounded quantifier is {n,m}, where the n stands for the lower bound, and the m stands for the upper bound. So, the pattern /a{2,5}/ means match an ‘a’ at least two times but no more than 5 times. Like all of the quantifiers I’ve mentioned, this is greedy and will match as many as it can within the given bounds (we will consider greediness in the next installment).

Variations of the bounded quantifier are as follows:

    {n,m}   match at least n times, but no more than m times
    {n,}    match n or more times
    {n}     match exactly n times

From these forms you can see that the three previous quantifiers (*, +, ?) all have some equivelant form of representation using the bounded quantifier:

    *  ==  {0,}  zero or more
    +  ==  {1,}  one or more
    ?  ==  {0,1} zero or one

As I mentioned, all of these quantifiers are referred to as ‘greedy’ quantifiers — given a choice they will match the maximum number of times they can while still allowing the rest of the regex to match. Next week we will look closer at what greediness means, and how we can use non-greedy forms of the above quantifiers.

Regular Expressions Tutorial, Part 3: Anchoring Matches

Andrew L. Johnson (First published by ItWorld.com 2000-12-21)

In the last couple of weeks we have covered what I refer to as the five basic concepts (concatenation, alternation, quantification, grouping, and wildcards) and we have introduced character classes. This week we introduce a new kind of regex element: an anchor.

An anchor is a way to specify a position in the target string that has particular properties. The two main anchors are the caret ’^’ and the dollar ’$’ symbols, which refer to the start and end of the target string respectively. Thus, the pattern /foo/ will match if the target string contains ‘foo’ anywhere, but the pattern /^foo/ will only match if the target string contains ‘foo’ at the beginning of the string. The pattern reads: match the start of the string followed by an ‘f’ followed by an ‘o’ followed by an ‘o’. Similarly, the pattern /foo$/ will match if the target contains ‘foo’ at the end of the string. The precise meaning of these two anchors can be changed with the /m modifier which will cause them to match at the beginning and end of each line within the string rather than just at the beginning and end of the entire string. The \A and \Z anchors are similar to ^ and $ respectively, but they always the beginning and end of the string and never at internal line boundaries.

If we wanted a script to count the lines of code (LOC) in a Perl program and ignore comment lines, blank lines, and lines after the END or DATA tags, we could write it as:

    #!/usr/bin/perl -w
    use strict;
    my $count = 0;
    while(<>){
        next if /^\s*#/;  # ignore comment lines
        next if /^\s*$/;  # ignore blank lines
        last if /^(__END__|__DATA__)/;  # stop
        $count++;
    }
    print "There are $count lines of code\n";

It isn’t perfect (there could be POD markup anywhere within the program and not just after the END or DATA tokens), but it provides a reasonable measure. The first regex /^\s*#/ matches any line starting with optional whitespace and a # character (a line containing only a comment). The second regex /^\s*$/ matches a line containing only optional whitespace (blank lines), and the final regex matches lines beginning with one of the two program-ending tokens (if you are putting your subroutines after such a token for auto-loading purposes then you wouldn’t want to include this line in your LOC counting program).

Another anchor is the \b metacharacter which matches what is often called word boundary. This matches a position in the target string between a \w and \W character, or between the start or end of the string and a \w character. The \G anchor matches the point where the previous m//g match left off (that is, at the current pos() for the target string).

Anchors are also referred to as zero-width assertions because they match a position in the string and do not consume any characters in the string. Thus, other zero-width constructs such as positive and negative look ahead assertions can also be thought of as anchors. A positive look ahead is written as (?=some pattern) and means that we match the current position in the string (without consuming anything) only if ‘some pattern’ could match at this point. A negative look ahead (?!some pattern) matches the current position if the given pattern fails at the current position in the target string.

Consider a case where we want to print any line containing all search terms in any order. If you know how many terms you’ll have in advance (say 3), you could something along the lines of:

    print if /foo/ && /bar/ && /baz/;

However, the following construct of multiple look ahead assertions can be useful in other cases:

    print if /^(?=.*foo)(?=.*bar)(?=.*baz)/;

This works because none of the look ahead assertions consume any of the target string, so that each assertion is tested from the beginning of the string in turn.

Consider a program that accepts as its first argument a string of space separated search terms — you do not know how many you’ll get but you want to print any line containing all of the terms in any order:

    #!/usr/bin/perl -w
    use strict;
    my $search = shift @ARGV;
    my $pattern = join('', map{"(?=.*$_)"} split " ", $search);
    while(<>){
        print if /^$pattern/o;
    }

Here we have constructed the multiple look ahead pattern by splitting the first argument into component search terms and wrapping each inside of a look ahead assertion.

*****

Regular Expressions Tutorial, Part 2 Character Classes

Andrew L. Johnson (First published by ItWorld.com 2000-12-14)

A character class is a way of specifying a list of characters, any of which you would like to match at that point in the pattern. We create a character class by enclosing the characters within square brackets like so: [abc]. That would match one of either ‘a’, ‘b’, or ‘c’, and as you can see it is just a simpler way to write (a|b|c).

However, character classes also allow us to specify a range of characters like so: [a-z]. That class matches any lower case letter from ‘a’ to ‘z’ inclusive. To match any alphabetical character regardless of case we can combine two ranges in the same character class: [a-zA-Z]. We can also use match any digit character using the same technique [0-9]. Last week we mentioned how problematic it would be to use only grouping and alternation to match an ‘f’ followed by any single digit followed by any alphabetical character (regardless of case). Now we can do so easily: m/f[0-9][a-zA-Z]/.

Character classes have a useful opposite called a negated or negative character class, and we specify one of those by using the ’^’ character as the very first character in the class: [^a-m], this class means match any character that is not a lower case letter from ‘a’ to ‘m’.

There are three shortcut sequences that represent three widely used character classes: \w means any alphanumeric or underscore character and is equivalent to the class: [a-zA-Z0-9_]; \s means any whitespace character including spaces, tabs, newlines, carriage returns, or line-feeds and represents the class: [ \t\r\f\n]. Finally, \d represents a digit as in the class [0-9]. Each of these three shortcuts also have an opposite that represents the negated version of each: \W (anything but an alphanumeric or underscore), \S (anything but a whitespace character), and \D (anything but a digit).

With these shortcuts we can begin to really put together some complex patterns with ease. Let’s consider a case where we are given a file of % separated records that contain date stamps in the last field, and these dates are of the form: Thu Nov 30 17:42:47 2000. Our problem is that the data-entry operator who worked the night shift was unreliable so we want to print out all records whose time portion of the date stamp is between 5:30pm (or 17:30:00) and the end of the night shift at 11:30pm (21:30:00) — there won’t be any later than that time so we only have to ensure we are past 17:30:00.

    while(<DATA>){
        my @fields = split /%/;
        print if $fields[2] =~ m/17:[3-6]\d:\d\d|1[89]:\d\d:\d\d|2\d:\d\d:\d\d/;
    }
    __DATA__
    foo%bar%Thu Nov 30 17:42:47 2000
    foo%bar%Thu Nov 30 19:42:47 2000
    foo%bar%Thu Nov 30 20:42:47 2000
    foo%bar%Thu Nov 30 17:20:47 2000
    foo%bar%Thu Nov 30 21:20:47 2000
    foo%bar%Thu Nov 30 16:42:47 2000

Character classes allow us great convenience in specifying a set of characters to match (or not match in the case of a negated character class) — think of them as restricted wildcard items, a dot is a (mostly) unrestricted wildcard and a character class is a wildcard restricted to the elements inside of the class.

*****