Simple Things
Syndicate: full/short
Article Calendar
<= July, 2014
Search this blog

Key links
External Blogs
Brought to you by ...

Regular Expressions Tutorial -- Part 1 Back to Basics

Andrew L. Johnson (First published by 2000-12-07)

A regular expression (regex) is a way of describing a pattern of text (rather than merely a literal substring of text) that we may want to match, extract, or replace with something else. We create such patterns using the regex language features which consists largely of literal characters (alphanumeric and a few others) that stand for themselves, and several special characters or character sequences that have particular meanings within a regex pattern.

In this first part of the tutorial we will outline the 5 basic concepts needed to understand regular expressions:

  1. Concatenation: This is an implicit assumption meaning simply that we can create larger, more complex patterns by simply combining simpler patterns. For example, m/f/ is a pattern that matches the character ‘f’, and m/o/ is a pattern that matches the character ‘o’. We can combine these into m/foo/ to match the character sequence ‘foo’.
  2. Alternation: The ’|’ character is a meta-character inside a regular expression. It acts as an operator that allows us to specify two or more alternative sub-patterns. For example: the pattern m/ab|cd/ will match either ‘ab’ or ‘cd’.
  3. Iteration: The ’*’ meta-character is an iterative or quantitative operator that means to match zero-or-more of the previous element. For example: the pattern m/a*b/ would match ‘ab’, ‘aab’, ‘aaab’, etc., or even just ‘b’ (ie, zero-or-more ‘a’ characters followed by a ‘b’ character).
  4. Grouping: Parentheses give us a way to create subexpressions that are treated as a unit. For example, if we want to match zero-or-more occurrences of the substring ‘foo’ we could specify our pattern as: m/(foo)*/. Here the * is used outside of the parentheses and applies the whole parenthesized subexpression. Parentheses also govern the scope of alternation: the pattern m/ab|cd/ means match either ‘ab’ or ‘cd’, but the pattern m/a(b|c)d/ means match an ‘a’, then either a ‘b’ or a ‘c’, and finally a ‘d’.
  5. Wildcard: The dot . is wildcard character that matches any character other than a newline character (this can be changed to include the newline as well). Thus, the pattern: m/f.*bar/ will match an ‘f’ followed by zero-or-more of any characters, followed by ‘bar’.

Those are the primary concepts for regular expressions, and although there are many more meta-characters and concepts, many are derived from just these basics. Let’s consider a couple of simple examples.

If we want to read in a file line-by-line and print out only lines that have a ‘foo’ followed somewhere on the same line by ‘bar’ we could use this pattern:

        print if /foo.*bar/;

However, if we want to print lines that match either ‘foodbar’ or ‘footbar’ we could do this:

        print if /foo(d|t)bar/;

We can write quite complicated regular expressions using just the above concepts, but they would very quickly get too long to manage. For example, if we wanted to print out lines containing two digits together we could write:

        print if /(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)/;

As you can see, while we can do it, it won’t be a pleasant task to try to match something like an ‘f’ followed by any digit followed by any alphabetical character (regardless of case) using only alternation as in the above example.

Next week we’ll look at the character class and several shortcut sequences that will make such tasks a great deal simpler (not to mention a lot shorter as well).