Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= May, 2008
S M T W T F S
123
45678910
11121314151617
18192021222324
25262728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Regular Expressions Tutorial, Part 3: Anchoring Matches

Andrew L. Johnson (First published by ItWorld.com 2000-12-21)

In the last couple of weeks we have covered what I refer to as the five basic concepts (concatenation, alternation, quantification, grouping, and wildcards) and we have introduced character classes. This week we introduce a new kind of regex element: an anchor.

An anchor is a way to specify a position in the target string that has particular properties. The two main anchors are the caret ’^’ and the dollar ’$’ symbols, which refer to the start and end of the target string respectively. Thus, the pattern /foo/ will match if the target string contains ‘foo’ anywhere, but the pattern /^foo/ will only match if the target string contains ‘foo’ at the beginning of the string. The pattern reads: match the start of the string followed by an ‘f’ followed by an ‘o’ followed by an ‘o’. Similarly, the pattern /foo$/ will match if the target contains ‘foo’ at the end of the string. The precise meaning of these two anchors can be changed with the /m modifier which will cause them to match at the beginning and end of each line within the string rather than just at the beginning and end of the entire string. The \A and \Z anchors are similar to ^ and $ respectively, but they always the beginning and end of the string and never at internal line boundaries.

If we wanted a script to count the lines of code (LOC) in a Perl program and ignore comment lines, blank lines, and lines after the END or DATA tags, we could write it as:

    #!/usr/bin/perl -w
    use strict;
    my $count = 0;
    while(<>){
        next if /^\s*#/;  # ignore comment lines
        next if /^\s*$/;  # ignore blank lines
        last if /^(__END__|__DATA__)/;  # stop
        $count++;
    }
    print "There are $count lines of code\n";

It isn’t perfect (there could be POD markup anywhere within the program and not just after the END or DATA tokens), but it provides a reasonable measure. The first regex /^\s*#/ matches any line starting with optional whitespace and a # character (a line containing only a comment). The second regex /^\s*$/ matches a line containing only optional whitespace (blank lines), and the final regex matches lines beginning with one of the two program-ending tokens (if you are putting your subroutines after such a token for auto-loading purposes then you wouldn’t want to include this line in your LOC counting program).

Another anchor is the \b metacharacter which matches what is often called word boundary. This matches a position in the target string between a \w and \W character, or between the start or end of the string and a \w character. The \G anchor matches the point where the previous m//g match left off (that is, at the current pos() for the target string).

Anchors are also referred to as zero-width assertions because they match a position in the string and do not consume any characters in the string. Thus, other zero-width constructs such as positive and negative look ahead assertions can also be thought of as anchors. A positive look ahead is written as (?=some pattern) and means that we match the current position in the string (without consuming anything) only if ‘some pattern’ could match at this point. A negative look ahead (?!some pattern) matches the current position if the given pattern fails at the current position in the target string.

Consider a case where we want to print any line containing all search terms in any order. If you know how many terms you’ll have in advance (say 3), you could something along the lines of:

    print if /foo/ && /bar/ && /baz/;

However, the following construct of multiple look ahead assertions can be useful in other cases:

    print if /^(?=.*foo)(?=.*bar)(?=.*baz)/;

This works because none of the look ahead assertions consume any of the target string, so that each assertion is tested from the beginning of the string in turn.

Consider a program that accepts as its first argument a string of space separated search terms — you do not know how many you’ll get but you want to print any line containing all of the terms in any order:

    #!/usr/bin/perl -w
    use strict;
    my $search = shift @ARGV;
    my $pattern = join('', map{"(?=.*$_)"} split " ", $search);
    while(<>){
        print if /^$pattern/o;
    }

Here we have constructed the multiple look ahead pattern by splitting the first argument into component search terms and wrapping each inside of a look ahead assertion.

*****