
A character class is a way of specifying a list of characters, any of which you would like to match at that point in the pattern. We create a character class by enclosing the characters within square brackets like so: [abc]. That would match one of either ‘a’, ‘b’, or ‘c’, and as you can see it is just a simpler way to write (a|b|c).
However, character classes also allow us to specify a range of characters like so: [a-z]. That class matches any lower case letter from ‘a’ to ‘z’ inclusive. To match any alphabetical character regardless of case we can combine two ranges in the same character class: [a-zA-Z]. We can also use match any digit character using the same technique [0-9]. Last week we mentioned how problematic it would be to use only grouping and alternation to match an ‘f’ followed by any single digit followed by any alphabetical character (regardless of case). Now we can do so easily: m/f[0-9][a-zA-Z]/.
Character classes have a useful opposite called a negated or negative character class, and we specify one of those by using the ’^’ character as the very first character in the class: [^a-m], this class means match any character that is not a lower case letter from ‘a’ to ‘m’.
There are three shortcut sequences that represent three widely used character classes: \w means any alphanumeric or underscore character and is equivalent to the class: [a-zA-Z0-9_]; \s means any whitespace character including spaces, tabs, newlines, carriage returns, or line-feeds and represents the class: [ \t\r\f\n]. Finally, \d represents a digit as in the class [0-9]. Each of these three shortcuts also have an opposite that represents the negated version of each: \W (anything but an alphanumeric or underscore), \S (anything but a whitespace character), and \D (anything but a digit).
With these shortcuts we can begin to really put together some complex patterns with ease. Let’s consider a case where we are given a file of % separated records that contain date stamps in the last field, and these dates are of the form: Thu Nov 30 17:42:47 2000. Our problem is that the data-entry operator who worked the night shift was unreliable so we want to print out all records whose time portion of the date stamp is between 5:30pm (or 17:30:00) and the end of the night shift at 11:30pm (21:30:00) — there won’t be any later than that time so we only have to ensure we are past 17:30:00.
while(<DATA>){
my @fields = split /%/;
print if $fields[2] =~ m/17:[3-6]\d:\d\d|1[89]:\d\d:\d\d|2\d:\d\d:\d\d/;
}
__DATA__
foo%bar%Thu Nov 30 17:42:47 2000
foo%bar%Thu Nov 30 19:42:47 2000
foo%bar%Thu Nov 30 20:42:47 2000
foo%bar%Thu Nov 30 17:20:47 2000
foo%bar%Thu Nov 30 21:20:47 2000
foo%bar%Thu Nov 30 16:42:47 2000
Character classes allow us great convenience in specifying a set of characters to match (or not match in the case of a negated character class) — think of them as restricted wildcard items, a dot is a (mostly) unrestricted wildcard and a character class is a wildcard restricted to the elements inside of the class.
*****