Siaris

More Record Oriented Reading
17 May 01 - http://www.siaris.net/index.cgi/Programming/LanguageBits/Perl/20010517.rdoc

Last week we looked at three common cases for defining the ‘record’ that Perl’s input operator reads: line oriented records, paragraph oriented records, and file oriented records. Now we’ll look at grabbing and parsing records in a more custom format.

One fairly common type of multi-line record is the name/value pair format:

    name = andrew
    age = 37
    beer = dark ale
    *****
    name = john
    beer    = pale ale
    age=35
    *****

In this case we have records consisting of three fields and the record separator is "*****\n". One simple thing we can do is to read this data in and output it with a different record separator:

    $/ = "*****\n";
    while(<>){
        chomp;
        print "$_#####\n";
    }

Not very exciting, but it illustrates the concept of the input record separator. We set the input record separator to the string "*****\n" and then when we read from the file, the input operator reads one multi-line record at a time: everything up to and including the next record separator string is returned. This also illustrates that chomp() removes any trailing record separator, not just any trailing newline character (and not just the last character of the string as does the chop() function). In the above we are merely removing the record separator and producing output with a new separator string.

A slightly less trivial task might be to convert such a file into a csv style data file. Let’s assume we’ve been asked to read a file such as the above and produce a csv file with a ’:’ as the field separator and the fields should be ordered: name, age, beer.

    $/ = "*****\n";
    my @fields = qw/name age beer/;
    while(<DATA>){
        chomp;
        my %hash = split /\s*[\n=]\s*/;
        print join(':',@hash{@fields}),"\n";
    }

    __DATA__
    name = andrew
    age = 37
    beer = dark ale
    *****
    name = john
    beer    = pale ale
    age=35
    *****

Here we’ve chomp()’d off the record separator and created a hash by splitting the record at every ’=’ and "\n" character. We were careful to account for possible whitespace around the ’=’ character in our split(), but we may need to be a little more careful: What if one of the field lines is indented? In that case we would have a hash key like " name" instead of just "name", and our hash slice to produce the output wouldn’t work as we wished. A more careful approach would be to split the data into an array and strip any leading spaces before we create our hash:

    $/ = "*****\n";
    my @fields = qw/name age beer/;
    while(<DATA>){
        chomp;
        my @data = split /\s*[\n=]\s*/;
        s/^\s+// for @data;

        my %hash = @data;
        print join(':',@hash{@fields}),"\n";
    }

    __DATA__
        name = andrew
    age = 37
    beer = dark ale
    *****
    name = john
      beer    = pale ale
    age=35
    *****

Whatever the form of your plain text database, be it csv or multi-line records, Perl makes it rather easy to process the data one record at a time.

*****