Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc0
Article Calendar
<= July, 2014
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

The /c Regex Modifier

Andrew L. Johnson (First published by ItWorld.com 2001-10-11)

To understand the /c regex modifier you first need to know how the /g modifier and the \G anchor behave. The /g modifier, as you probably already know, means ‘keep applying the regex until it fails or we hit the end of the string’:

    $_ = '123456abc789';
    my $pattern = '\d\d\d';
    while ( m/($pattern)/g ) {
        print "$1\n";
    }

The above will match each sequence of 3 digits and execute the loop. Each string has a positional marker associated with it that records where the last regex match ended — you can access or set this marker directly with the pos() function — thus the regex engine knows where to continue searching from in the string. When the pattern can no longer be found, the match operator returns false (ending the while loop in this case) and the positional marker is reset to 0 (the beginning of the string).

One thing to notice is that the above snippet will skip over the ‘abc’ part of the string — that is, on the third attempt to match, we start trying to match at position 6 (right before the ‘a’) but we aren’t forced to actually match at that point. To force the match to succeed where we left off we would do:

    $_ = '123456abc789';
    my $pattern = '\d\d\d';
    while ( m/\G($pattern)/g ) {
        print "$1\n";
    }

In this case, each occurrence of $pattern must be found immediately following the positional marker (either the beginning of the string, or wherever the last successful match left off). Thus, this snippet only finds and prints ‘123’, and ‘456’, and then the match fails.

What if we wanted to be able to match different patterns while stepping through the string (say, sequences of three digits or three lowercase letters)? We could set up an alternation pattern and then test the captured results:

    $_ = '123456abc789';
    my $pattern = '\d\d\d|[a-z]{3}';
    while ( m/\G($pattern)/g ) {
        my $result = $1;
        if ($result =~ /\d/) {
            print "We got 3 digits\n";
        } else {
            print "We got 3 letters\n";
        }
    }

That’s not horrible, though we needed to test for numbers twice (once in the original pattern, and once in the if test). This could get more cumbersome if we had more choices to distinguish (and slower because alternations in regexen are somewhat slow).

The /c modifier allows a /g match to fail without resetting the positional marker — so we can try another match:

    $_ = '123XYZ456abc789';
    while (1) {
        print "Got digits ($1)\n" and next if m/\G(\d\d\d)/gc;
        print "Got UCase  ($1)\n" and next if m/\G([A-Z]{3})/gc;
        print "Got LCase  ($1)\n" and next if m/\G([a-z]{3})/gc;
        print "End of Parsing\n"  and last if m/\G$/gc;
        print "Parse Error at position: ", pos(), "\n" and last;
    }

Now we never skip over any data that we haven’t accounted for, yet when any regex fails we simply try the next the regex from the same position. Our parse of the string only fails if all of the regexen fail and we hit the last line of the loop. The above succeeds through the string, but if you try $_ = ‘123ABC456ab789’; you’ll get a parse error message at position 9. If you tried this without the /c modifier you would have a problem because the if the first regex fails it would reset the positional marker to 0 (meaning you wouldn’t be starting where you wanted with the next regex).

*****