In this final installment of our regular expression tutorial we will look at capturing, grouping, and backreferences. We have looked at some of these briefly before, but I couldn’t very well have a multi-part tutorial without including them here.
Parentheses are used for both grouping subexpressions in a pattern, and capturing the text matched for later use in the special digit variables ($1, $2, $3, … etc). For example:
$_ = 'this string has a bar in it';
m/has a (foo|bar)/;
print "$1\n"; # prints: bar
As we saw in the first installment, the grouping effect also limits the scope of the alternation operator. In the above, we match the sequence ‘has a ’ followed by either a ‘foo’ or ‘bar’ sequence and we capture whatever was matched in the parenthesized sub-pattern into the $1 variable.
We can also achieve grouping without capturing by using the grouping only construct: (?:pattern). Sometimes we do not wish to capture a particular subexpression, but merely group it for use with a quantifier, or alternator. This construct can be more efficient because there is no special variable associated with it that must be updated during the match.
We know we can capture something and use it later, either in the replacement side of a s/// operation, or in any expression that follows the match. We can also refer to the text previously matched within the same regular expression by using a backslash followed by the number of the subexpression you want to match again. For example, if we want to read through a wordlist (one word per line) and print out every word that contains two repeated letters we can do this:
while(<>){
print if /(\w)\1/;
}
This means if we find a \w character immediately followed by the same character that was matched (not just followed by any other \w character) then we print that line. Consider how you might try to do that without backreferences. Here’s a very interesting (but not efficient) program for printing out prime numbers using backreferences in a regex (based on a posted to comp.lang.perl.misc by Abigail some time ago):
#!/usr/bin/perl -w
use strict;
my $number = $ARGV[0] || 100;
for(my $n = 2; $n <= $number; $n++){
$_ = '1' x $n;
print "$n is prime\n" unless /^(11+)\1+$/;
}
You pass it a positive integer on the command line and it will print all the primes up to that number (by default it will print the primes up to 100). All the testing of divisibility is done by the regex engine attempting to match equal groupings of the digit ‘1’ from the start to end of the string.
*****



