Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= May, 2012
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Regular Expressions Tutorial, Part 3: Anchoring Matches

Andrew L. Johnson (First published by ItWorld.com 2000-12-21)

In the last couple of weeks we have covered what I refer to as the five basic concepts (concatenation, alternation, quantification, grouping, and wildcards) and we have introduced character classes. This week we introduce a new kind of regex element: an anchor.

An anchor is a way to specify a position in the target string that has particular properties. The two main anchors are the caret ’^’ and the dollar ’$’ symbols, which refer to the start and end of the target string respectively. Thus, the pattern /foo/ will match if the target string contains ‘foo’ anywhere, but the pattern /^foo/ will only match if the target string contains ‘foo’ at the beginning of the string. The pattern reads: match the start of the string followed by an ‘f’ followed by an ‘o’ followed by an ‘o’. Similarly, the pattern /foo$/ will match if the target contains ‘foo’ at the end of the string. The precise meaning of these two anchors can be changed with the /m modifier which will cause them to match at the beginning and end of each line within the string rather than just at the beginning and end of the entire string. The \A and \Z anchors are similar to ^ and $ respectively, but they always the beginning and end of the string and never at internal line boundaries.

If we wanted a script to count the lines of code (LOC) in a Perl program and ignore comment lines, blank lines, and lines after the END or DATA tags, we could write it as:

    #!/usr/bin/perl -w
    use strict;
    my $count = 0;
    while(<>){
        next if /^\s*#/;  # ignore comment lines
        next if /^\s*$/;  # ignore blank lines
        last if /^(__END__|__DATA__)/;  # stop
        $count++;
    }
    print "There are $count lines of code\n";

It isn’t perfect (there could be POD markup anywhere within the program and not just after the END or DATA tokens), but it provides a reasonable measure. The first regex /^\s*#/ matches any line starting with optional whitespace and a # character (a line containing only a comment). The second regex /^\s*$/ matches a line containing only optional whitespace (blank lines), and the final regex matches lines beginning with one of the two program-ending tokens (if you are putting your subroutines after such a token for auto-loading purposes then you wouldn’t want to include this line in your LOC counting program).

Another anchor is the \b metacharacter which matches what is often called word boundary. This matches a position in the target string between a \w and \W character, or between the start or end of the string and a \w character. The \G anchor matches the point where the previous m//g match left off (that is, at the current pos() for the target string).

Anchors are also referred to as zero-width assertions because they match a position in the string and do not consume any characters in the string. Thus, other zero-width constructs such as positive and negative look ahead assertions can also be thought of as anchors. A positive look ahead is written as (?=some pattern) and means that we match the current position in the string (without consuming anything) only if ‘some pattern’ could match at this point. A negative look ahead (?!some pattern) matches the current position if the given pattern fails at the current position in the target string.

Consider a case where we want to print any line containing all search terms in any order. If you know how many terms you’ll have in advance (say 3), you could something along the lines of:

    print if /foo/ && /bar/ && /baz/;

However, the following construct of multiple look ahead assertions can be useful in other cases:

    print if /^(?=.*foo)(?=.*bar)(?=.*baz)/;

This works because none of the look ahead assertions consume any of the target string, so that each assertion is tested from the beginning of the string in turn.

Consider a program that accepts as its first argument a string of space separated search terms — you do not know how many you’ll get but you want to print any line containing all of the terms in any order:

    #!/usr/bin/perl -w
    use strict;
    my $search = shift @ARGV;
    my $pattern = join('', map{"(?=.*$_)"} split " ", $search);
    while(<>){
        print if /^$pattern/o;
    }

Here we have constructed the multiple look ahead pattern by splitting the first argument into component search terms and wrapping each inside of a look ahead assertion.

*****

Regular Expressions Tutorial, Part 2 Character Classes

Andrew L. Johnson (First published by ItWorld.com 2000-12-14)

A character class is a way of specifying a list of characters, any of which you would like to match at that point in the pattern. We create a character class by enclosing the characters within square brackets like so: [abc]. That would match one of either ‘a’, ‘b’, or ‘c’, and as you can see it is just a simpler way to write (a|b|c).

However, character classes also allow us to specify a range of characters like so: [a-z]. That class matches any lower case letter from ‘a’ to ‘z’ inclusive. To match any alphabetical character regardless of case we can combine two ranges in the same character class: [a-zA-Z]. We can also use match any digit character using the same technique [0-9]. Last week we mentioned how problematic it would be to use only grouping and alternation to match an ‘f’ followed by any single digit followed by any alphabetical character (regardless of case). Now we can do so easily: m/f[0-9][a-zA-Z]/.

Character classes have a useful opposite called a negated or negative character class, and we specify one of those by using the ’^’ character as the very first character in the class: [^a-m], this class means match any character that is not a lower case letter from ‘a’ to ‘m’.

There are three shortcut sequences that represent three widely used character classes: \w means any alphanumeric or underscore character and is equivalent to the class: [a-zA-Z0-9_]; \s means any whitespace character including spaces, tabs, newlines, carriage returns, or line-feeds and represents the class: [ \t\r\f\n]. Finally, \d represents a digit as in the class [0-9]. Each of these three shortcuts also have an opposite that represents the negated version of each: \W (anything but an alphanumeric or underscore), \S (anything but a whitespace character), and \D (anything but a digit).

With these shortcuts we can begin to really put together some complex patterns with ease. Let’s consider a case where we are given a file of % separated records that contain date stamps in the last field, and these dates are of the form: Thu Nov 30 17:42:47 2000. Our problem is that the data-entry operator who worked the night shift was unreliable so we want to print out all records whose time portion of the date stamp is between 5:30pm (or 17:30:00) and the end of the night shift at 11:30pm (21:30:00) — there won’t be any later than that time so we only have to ensure we are past 17:30:00.

    while(<DATA>){
        my @fields = split /%/;
        print if $fields[2] =~ m/17:[3-6]\d:\d\d|1[89]:\d\d:\d\d|2\d:\d\d:\d\d/;
    }
    __DATA__
    foo%bar%Thu Nov 30 17:42:47 2000
    foo%bar%Thu Nov 30 19:42:47 2000
    foo%bar%Thu Nov 30 20:42:47 2000
    foo%bar%Thu Nov 30 17:20:47 2000
    foo%bar%Thu Nov 30 21:20:47 2000
    foo%bar%Thu Nov 30 16:42:47 2000

Character classes allow us great convenience in specifying a set of characters to match (or not match in the case of a negated character class) — think of them as restricted wildcard items, a dot is a (mostly) unrestricted wildcard and a character class is a wildcard restricted to the elements inside of the class.

*****

Regular Expressions Tutorial -- Part 1 Back to Basics

Andrew L. Johnson (First published by ItWorld.com 2000-12-07)

A regular expression (regex) is a way of describing a pattern of text (rather than merely a literal substring of text) that we may want to match, extract, or replace with something else. We create such patterns using the regex language features which consists largely of literal characters (alphanumeric and a few others) that stand for themselves, and several special characters or character sequences that have particular meanings within a regex pattern.

In this first part of the tutorial we will outline the 5 basic concepts needed to understand regular expressions:

  1. Concatenation: This is an implicit assumption meaning simply that we can create larger, more complex patterns by simply combining simpler patterns. For example, m/f/ is a pattern that matches the character ‘f’, and m/o/ is a pattern that matches the character ‘o’. We can combine these into m/foo/ to match the character sequence ‘foo’.
  2. Alternation: The ’|’ character is a meta-character inside a regular expression. It acts as an operator that allows us to specify two or more alternative sub-patterns. For example: the pattern m/ab|cd/ will match either ‘ab’ or ‘cd’.
  3. Iteration: The ’*’ meta-character is an iterative or quantitative operator that means to match zero-or-more of the previous element. For example: the pattern m/a*b/ would match ‘ab’, ‘aab’, ‘aaab’, etc., or even just ‘b’ (ie, zero-or-more ‘a’ characters followed by a ‘b’ character).
  4. Grouping: Parentheses give us a way to create subexpressions that are treated as a unit. For example, if we want to match zero-or-more occurrences of the substring ‘foo’ we could specify our pattern as: m/(foo)*/. Here the * is used outside of the parentheses and applies the whole parenthesized subexpression. Parentheses also govern the scope of alternation: the pattern m/ab|cd/ means match either ‘ab’ or ‘cd’, but the pattern m/a(b|c)d/ means match an ‘a’, then either a ‘b’ or a ‘c’, and finally a ‘d’.
  5. Wildcard: The dot . is wildcard character that matches any character other than a newline character (this can be changed to include the newline as well). Thus, the pattern: m/f.*bar/ will match an ‘f’ followed by zero-or-more of any characters, followed by ‘bar’.

Those are the primary concepts for regular expressions, and although there are many more meta-characters and concepts, many are derived from just these basics. Let’s consider a couple of simple examples.

If we want to read in a file line-by-line and print out only lines that have a ‘foo’ followed somewhere on the same line by ‘bar’ we could use this pattern:

    while(<>){
        print if /foo.*bar/;
    }

However, if we want to print lines that match either ‘foodbar’ or ‘footbar’ we could do this:

    while(<>){
        print if /foo(d|t)bar/;
    }

We can write quite complicated regular expressions using just the above concepts, but they would very quickly get too long to manage. For example, if we wanted to print out lines containing two digits together we could write:

    while(<>){
        print if /(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)/;
    }

As you can see, while we can do it, it won’t be a pleasant task to try to match something like an ‘f’ followed by any digit followed by any alphabetical character (regardless of case) using only alternation as in the above example.

Next week we’ll look at the character class and several shortcut sequences that will make such tasks a great deal simpler (not to mention a lot shorter as well).

*****

Sorting in Perl II (Schwartzian Transform)

Andrew L. Johnson (First published by ItWorld.com 2000-11-30)

Last week we explored using Perl’s sort() function for simple sorting uses. This week we will take a brief look at using the sort() function efficiently when the sort fields require extraction or some computation. We will do this by demonstrating the Schwartzian Transform (named after Randal Schwartz, and often abbreviated to just ST).

Let’s consider a simple case of wanting to sort colon separated records on one or more fields. Here is some sample data:

    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    qux:19:foo:1.2

Now, the following is one attempt to sort this data on the 3rd data field:

    #!/usr/bin/perl -w
    use strict;
    my @data   = <DATA>;
    my @sorted = sort custom @data;
    print @sorted;

    sub custom {
        (split /:/,$a)[2] cmp (split /:/,$b)[2];
    }
    __DATA__
    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    qux:19:foo:1.2

This works but there is an efficiency problem here — when sorting a list the comparison function (custom() in this case) is called many times, and each time in the above case it is performing 2 split() operations. A far better approach is to preprocess the data so that the sort fields may be more directly accessed, and any operations performed on the data need only be performed once per record. After sorting, the data can then be post-processed to obtain the original records. Here is one way of doing that with each step separated:

    #!/usr/bin/perl -w
    use strict;
    my @data = <DATA>;
    my @pre    = map { [$_, split /:/ ] } @data;
    my @post   = sort custom @pre;
    my @sorted = map { $_->[0] } @post;
    print @sorted;

    sub custom {
        $a->[3] cmp $b->[3];
    }
    __DATA__
    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    aux:19:foo:1.2

In the above we have first preprocessed our records into a list of anonymous arrays where the first element is the entire record and the remaining elements are the split() fields of that record. We then sort those anonymous arrays on the field we desire (in this case the third index of each anonymous array). Now our @post array contains a sorted list of anonymous arrays sorted on the correct field and we merely extract the first element of each anon array to recover our original records. The three steps above can be combined by simply passing the results of the preprocessing stage directly into the sort stage and passing those results through the post-processing stage and out into our sorted array:

    #!/usr/bin/perl -w
    use strict;
    my @data = <DATA>;
    my @sorted = map { $_->[0] }
                 sort custom
                 map { [$_, split /:/ ] } @data;
    print @sorted;

    sub custom {
        $a->[3] cmp $b->[3];
    }
    __DATA__
    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    aux:19:foo:1.2

This ‘map/sort/map’ algorithm is what is referred to as the ST and provides a nice clean and relatively efficient way to sort on computed or extracted fields. We can also easily adjust our custom() comparison routine to sort on multiple fields as shown in last weeks column. For a more extended look at sorting, and an alternative efficient algorithm, please see the following article:

    http://www.hpl.hp.com/personal/Larry_Rosler/sort/sorting.html
Sorting in Perl

Andrew L. Johnson (First published by ItWorld.com 2000-11-23)

Perl’s sort function is a little different than the standard built-in functions — like map() and grep() it can accept a block as a first argument, but it can also accept a named subroutine, or a reference to a subroutine. This allows you to specify the comparison method used in sorting — the default sorting method is alphabetical (ascii-betical) and the comparison operator is the ‘cmp’ operator. Thus, the following two examples are the same:

    my @sorted = sort @unsorted;
    my @sorted = sort {$a cmp $b} @unsorted;

Inside the block or subroutine you use the package-global variables $a and $b to refer to the elements being compared (these are specially exempt from the ‘use strict’ pragma for just this reason). To specify a reverse alphabetical sort you switch the variables:

    my @reverse_sorted = sort {$b cmp $a} @unsorted;

You can define any comparison method you like, but you must make sure that your block or subroutine returns only -1, 0, or 1 depending on the ordering you want (the ‘cmp’ and ’<=>’ operators are usually used for this purpose).

To sort a list numerically, we’ll use the <=> operator:

    my @ascending  = sort {$a <=> $b} @unsorted;
    my @descending = sort {$b <=> $a} @unsorted;

A hash does not give us a way to store information in an ordered fashion but we often need to print out a hash’s contents in some sorted order. If we wish to print out a hash sorted by keys (alphabetically) we can simply do this:

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9, Joey => 16);
    foreach my $name (sort keys %ages) {
        print "$name : $ages{$name}\n";
    }

But what if I wanted to produce output sorted by age from oldest to youngest rather than names (by hash values rather than keys)? That’s also very easy:

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9, Joey => 16);
    foreach my $name (sort { $ages{$b} <=> $ages{$a} } keys %ages) {
        print "$name : $ages{$name}\n";
    }

In this case, we still get a list of keys, but in the comparison block we compare the hash values for each key rather than the keys themselves — thus, we are iterating over the list of hash keys sorted in descending numerical order by the hash values for those keys. Now, let’s consider the case where we want to dual sort — that is, if two ages (values) are equal we want to order data alphabetically by the name (key). We’ll have to add some additional children to my hash to demonstrate:

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9,  Joey => 16,
                 Karen  => 11, John => 9, Kevin  => 16, Lisa => 9);
    foreach my $name (sort by_age keys %ages) {
        print "$name : $ages{$name}\n";
    }

    sub by_age {
        $ages{$b} <=> $ages{$a}
                  ||
               $a cmp $b
    }

In this case, I’ve use a separate routine ‘by_age’ instead of a block. In the routine, the <=> returns 0 if the ages are equal and this is a false value so Perl looks to the other side of the logical OR operator (||) and evaluates the ‘cmp’ operation to get the return value of the function.

Starting with version 5.6.0 of Perl you no longer have to use $a and $b as your comparison variables when you use a separate subroutine. If you prototype the subroutine with ($$) then the arguments are passed to the routine in the @_ array:

    sub by_age ($$); # declare with proto before using

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9,  Joey => 16,
                 Karen  => 11, John => 9, Kevin  => 16, Lisa => 9);
    foreach my $name (sort by_age keys %ages) {
        print "$name : $ages{$name}\n";
    }

    sub by_age ($$) {
        $ages{$_[1]} <=> $ages{$_[0]}
                     ||
               $_[0] cmp $_[1]
    }

This is slower than using $a and $b, but it means you can use sort routines defined in other packages without having namespace problems. (see: ‘perldoc -f sort’ for further documentation).

In next week’s article we will examine ways to sort more complicated data such as sorting multi-field records on any field.

*****