Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= July, 2008
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Reading Record Oriented Data

Andrew L. Johnson (First published by ItWorld.com 2001-05-10)

When we read data files in Perl using the input operator (the <> operator), we are, by default, reading one line at a time. However, it is helpful to realize we aren’t really processing one line at a time but rather one record at a time — it is just that the default record separator happens to be a newline character, which is a useful default value.

Very often we will process files that have no real record structure (in the sense of records containing specific fields), such as when we want to print out lines in an ordinary text file that match a certain pattern, or perhaps when doing some search and replace operations on a plain text article we have written. In these cases, treating lines as records works as we expect.

More importantly, one of the most common forms of structured plain text data files (often just called flat-file databases) is the csv format (comma separated values — although, other characters besides commas are commonly used). Such records are usually terminated by newlines (one record to a line) and contain various fields separated by a specific character (or character string):

    name, rank, serial number

Processing such files is often simply a matter of:

    while(<>){
        chomp;
        my @fields = split /,/;
        process_fields(@fields);
    }

Assuming a suitable ‘process_fields()’ subroutine is defined.

However, we are certainly not limited to reading one-line records, whether in structured or unstructured files. The input record separator variable in Perl is $/ and we can set that variable to any string we wish to use as a record separator (the default, as mentioned, is "\n").

In unstructured text we may wish to use a blank line as a record separator — attempting to read in one paragraph at a time for processing. One way would be to set:

    $/ = "\n\n";

Anytime there are two newlines in a row, there must be a blank line in the file. Unfortunately, this is not very fault tolerant: If we had a file with 2 blank lines between 2 paragraphs, we’d read the first paragraph, and our second "paragraph" would contain an initial blank line. Worse, if we happened to have several blank lines between 2 paragraphs, we’d be reading in one or more empty paragraphs (records).

Fortunately, Perl has a special case for ‘paragraph’ records. Assigning an empty string to $/ puts Perl into paragraph mode and multiple blank lines are squashed into a single blank line (meaning we won’t get a bunch of empty paragraphs, but on the other hand, our output won’t have those extra blank lines either).

Another useful value for $/ is the undefined value (Perl’s ‘undef’ keyword). When we are reading a file in a typical while(<>) loop, the <> operator returns undef at the end of the file. Thus, if we set $/ to undef, we will read in the whole file as a single record (a single string). This can make a program more efficient speed-wise because we can operate on the whole file as a single string. Compare these two snippets:

    $/ = undef;       #   while(<>){
    $_ = <>;          #     tr/A-Z/a-z/;
    tr/A-Z/a-z/;      #     print;
    print;            #   }

They both accomplish the same thing (translating uppercase letters to lowercase letters), but the one on the left only calls the tr/// operator once while the second calls it once for each line in the file. for small to moderately sized files, the first will be faster. However, if the file is very large (relative to your available memory), then reading in the whole file may exhaust your memory entirely (which means the program won’t work at all), or perhaps cause your OS to start swapping (which will likely make it perform much worse than the version on the right).

*****

Speeding up a function with Memoization

Andrew L. Johnson (First published by ItWorld.com 2001-05-03)

Often we may have one or more functions in a program that take a non-trivial amount of time to compute a result. If such a ‘slow’ function may be called repeatedly, and often with the same arguments as previous calls, caching the return values may result in significant speed increases in the overall program.

Consider, for example, a subroutine ‘factorial’ that returns the factorial of its integer argument. A recursive version might be:

    sub factorial {
        my $n = shift;
        print "Computing factorial($n)\n";  # just for testing!!
        return undef if $n < 0;
        return 1 if $n == 0;
        return $n * factorial($n - 1) ;
    }
    print factorial(4), "\n";
    print factorial(5), "\n";

The print statement in the above is for illustrative purposes only, and running the above snippet produces:

    Computing factorial(4)
    Computing factorial(3)
    Computing factorial(2)
    Computing factorial(1)
    Computing factorial(0)
    24
    Computing factorial(5)
    Computing factorial(4)
    Computing factorial(3)
    Computing factorial(2)
    Computing factorial(1)
    Computing factorial(0)
    120

The problem here is that even though we had already computed the factorial of 4 (and 3, and 2, and 1, and 0) in the first call, we then recomputed all of those again to get the factorial of 5. By caching the return value we can save unnecessary recomputing of results:

    {
        my @cache = ();
        sub factorial {
            my $n = shift;
            print "Called factorial($n)\n";          # testing
            return $cache[$n] if defined $cache[$n];
            print "\tComputing factorial($n)\n";     # testing
            return undef if $n < 0;
            return 1 if $n == 0;
            return $cache[$n] = $n * factorial($n - 1) ;
        }
    }
    print factorial(4), "\n";
    print factorial(5), "\n";

Now we have used an array (@cache) to store the results of calls to the factorial() function and we can look up those values easily. Running the above produces:

    Called factorial(4)
            Computing factorial(4)
    Called factorial(3)
            Computing factorial(3)
    Called factorial(2)
            Computing factorial(2)
    Called factorial(1)
            Computing factorial(1)
    Called factorial(0)
            Computing factorial(0)
    24
    Called factorial(5)
            Computing factorial(5)
    Called factorial(4)
    120

So we have a significant savings when we called the subroutine the second time (it only needed to retrieve the result of the previous call to factorial(4) instead of recomputing it and its recursive dependents). Imagine the savings if we had called factorial(50) followed by factorial(51).

The Memoize module automates this caching for us by providing a function called memoize() which we can use to install cached versions of our ‘slow’ functions. Using this module and our original factorial() function we get:

    #!/usr/bin/perl -w
    use strict;
    use Memoize;
    memoize('factorial');

    print factorial(4), "\n";
    print factorial(5), "\n";

    sub factorial {
        my $n = shift;
        print "Computing factorial($n)\n";  # just for testing!!
        return undef if $n < 0;
        return 1 if $n == 0;
        return $n * factorial($n - 1) ;
    }

Which gives output of:

    Computing factorial(4)
    Computing factorial(3)
    Computing factorial(2)
    Computing factorial(1)
    Computing factorial(0)
    24
    Computing factorial(5)
    Computing factorial(4)
    120

And we see that although the factorial(5) resulted in a call to factorial(4), no further recursion occurs because the memoize() function has installed a cached version of the same function for us.

Memoization (or caching) is a very good technique when it is applicable, and it is not limited to functions that take just a single integer argument. The documentation for the Memoize module will instruct you on using multiple arguments and normalizing the arguments for functions with variable-order arguments. There are also ways to "expire" values in the cache to free up memory (perhaps if the function has not been called for some specified amount of time its value in the cache will be deleted).

The documentation also points out three particular cases where Memoization is not appropriate: 1) a function that uses (depends on) variables or state not defined within the function or its arguments; 2) a function that has side effects (does other things besides simply computing and returning a result); and 3) a function that returns a reference that is modified by the caller.

*****

Understanding References, Part 4: Anonymous Structures

Andrew L. Johnson (First published by ItWorld.com 2001-04-26)

Taking references to arrays or hashes is one way to build nested or multidimensional structures — but it isn’t necessary to actually use a reference to a named array or hash, Perl has anonymous arrays and hashes as well. To get one of these you use the anonymous array or hash composers — which are just the familiar [] for arrays and {} for hashes:

    my $aref = [42, 37, 11];
    my $href = {name => 'andrew', beer => 'dark ale'};
    print "$aref->[1] $href->{name}\n";  #prints: 37 andrew

In these instances, the [] and {} simply create the structure in memory (either an array or a hash respectively) and return a reference to it — there is no array or hash variable. This makes specifying a nested structure rather easier. Recall last week’s simple example:

    my @beer = ('dark-ale', 'pale-ale', 'stout');
    my %hash = ( name => 'andrew', beer => \@beer);
    print "$hash{name}'s favorite beer is $hash{beer}[0]\n";
    print "$hash{name} will accept: @{$hash{beer}}\n";

This can now be done more directly (without the @beer array) as:

    my %hash = ( name => 'andrew',
                 beer => ['dark-ale', 'pale-ale', 'stout'],
               );
    print "$hash{name}'s favorite beer is $hash{beer}[0]\n";
    print "$hash{name} will accept: @{$hash{beer}}\n";

You do not have to specify the list explicitly within the [] or {} composers — any expression that returns a list may be used as well. This means we can easily read in a csv file and turn it into a table (a two dimensional array):

    #!/usr/bin/perl -w
    use strict;
    my @data_table;
    while(<DATA>){
        chomp;
        push @data_table, [split /:/];
    }
    # print out all rows:
    foreach my $row (@data_table) {
        print "@$row\n";
    }

    # print out second column of second row:

    print "$data_table[1][1]\n";
    __DATA__
    one:two:three
    four:five:six
    seven:eight:nine

That sums up the brief tutorial on using references. For further information please see the following docs:

    perldoc perlreftut
    perldoc perlref
    perldoc perllol
    perldoc perldsc

I would also like to point out that version 5.6.1 is now available in the usual locations — this is a maintenance release in the stable track and fixes several bugs present in the 5.6.0 version.

I appreciate your suggestions for topics and tips — keep them coming. Next week we will look at speeding up certain kinds of functions by caching return values and look at the Memoize module.

Understanding References, Part 3: Nested Structures

Andrew L. Johnson (First published by ItWorld.com 2001-04-19)

We finished last week with an example of solving passing multiple arrays to a subroutine by passing references to arrays rather than the arrays themselves. You may not have noticed, but in doing so we made use of a multidimensional array (though only briefly). When we pass arguments to a subroutine, the arguments are stored in the @_ array — so when we pass array references we’ve actually made an array of arrays (well, an array of array references — but that’s what a two dimensional array is in Perl). Look at this example:

    my @one = (9,8,7);
    my @two = (6,5,4);
    foo(\@one, \@two);
    sub foo {
        print "$_[1][1]\n";
    }

The above prints 5 (the second element of the second array). Ignoring how the dereference works for a minute, let’s look at a similar example not involving subroutines:

    my @one = (9,8,7);
    my @two = (6,5,4);
    my @rows = (\@one, \@two);
    print "$rows[1][1]\n";      # prints: 5

The same thing is going on here except that we are explicitly assigning to an array @rows rather than implicitly assigning to the special @_ array (which holds subroutine arguments). Now, let’s return to dereferencing to figure out why this works as it does.

We have already seen two ways of dereferencing an array reference:

    my @array = (42, 13, 12);
    my $aref  =  \@array;
    print "${$aref}[1]\n";    # explicit
    print "$$aref[1]\n";      # shortcut

An alternate method is to use the "arrow" dereference syntax:

    my @array = (42, 13, 12);
    my $aref  =  \@array;
    print "$aref->[1]\n";

In this method, perl assumes that whatever is to the left of the "arrow" resolves to a reference, so perl dereferences it and then returns the value given by the subscript on the right. This works for hashes as well:

    my %hash = (name => 'andrew', beer => 'dark-ale');
    my $href = \%hash;
    print "$href->{beer}\n";

Now, we can return to the two dimensional array example and use the arrow syntax:

    my @one = (9,8,7);
    my @two = (6,5,4);
    my @rows = (\@one, \@two);
    print "$rows[1]->[1]\n";      # prints: 5

Here, everything to the left of the arrow ($rows[1]) resolves to a reference, and then we look up the [1] subscript in that array. Because Perl has no multidimensional arrays, Perl knows that ordinarily an expression like: $rows[1][1] wouldn’t make sense. So, Perl assumes that there is always an implicit arrow between any two subscripts — this is why $rows[1][1] actually works, because internally Perl assumes it to mean $rows[1]->[1].

The same is true for hash references and any mixture of hash and array references:

    my @beer = ('dark-ale', 'pale-ale', 'stout');
    my %hash = ( name => 'andrew', beer => \@beer);
    print "$hash{name}'s favorite beer is $hash{beer}[0]\n";
    print "$hash{name} will accept: @{$hash{beer}}\n";

In the above, the $hash{beer}[0] is the same as: $hash{beer}->[0]. The thing to the left of the array resolves to an array reference and the subscript on the right gets the value at that index. Now, in the second print statement we had to use the extra curly braces to dereference to entire array reference — we could not just do:

    @$hash{beer}

Because Perl assumes that $hash is the reference (we don’t even have a $hash variable in our example) instead of $hash{beer}. In other words, without the curly braces, it is parsed as:

    @{$hash}{beer}

which isn’t what we wanted. So we have to group the expression that actually resolves to the reference inside of curly braces

    @{ $hash{beer} }

and then perl knows exactly what we are trying to dereference. Creating multidimensional structures is not difficult — but Perl gives another very useful convenience called anonymous arrays and hashes that we will look at next week.

Next Week: Understanding References (part 4: anonymous structures)

*****

Understanding References, Part 2: References to Arrays and Hashes

Andrew L. Johnson (First published by ItWorld.com 2001-04-12)

Last week we looked at just the basics of taking a reference and then dereferencing it — this week we will look at taking references to arrays and hashes and expand on our dereferencing syntax.

Remember that a scalar variable can hold one thing — basically: a string, a number, or a reference. We’ve already stored a reference to a scalar variable in another scalar variable, but we can also store a reference to an array (or a hash) in a scalar variable as well:

    my @array = (42, 13, 99);
    my $aref  = \@array;

In this case, the variable $aref holds a reference to @array (or, in terms of last week’s discussion, it holds the memory slot number (address) of the array — actually, an array is many slots but we don’t have to worry about that at the moment). But how do we get at the array through the reference?

Recall the syntax we used to dereference a scalar variable:

    my $foo = 42;
    my $bar = \$foo;
    print $$bar;

That last line is actually a slight shortcut — the more explicit version would be:

    print ${$bar};

That is, we put the reference inside of curly braces and then precede it with the type of value of value we are dereferencing — in this case, we use a $ because the value we are getting at is a scalar value. So, for that array case we do the following:

    my @array = (42, 13, 99);
    my $aref  = \@array;
    print "@{$aref}\n";        # print whole array (explicit method)
    print "@$aref\n";          # print whole array (shortcut method)

To get at just one element, let’s first look at getting one element from the real array:

    print $array[1];     # prints: 13

Breaking this down we have: a type symbol for the type of value (arrays hold scalar values as elements), the array name, and the index. To do the same thing with a reference we replace the array name with the reference to it:

    print ${$aref}[1];  # explicit method
    print $$aref[1];    # shortcut method

We can do the same things with hashes as well:

    my %hash = ( name => 'Andrew', beer => 'Dark Ale' );
    my $href = \%hash;

    print "${$href}{name} ";       # explicit
    print "drinks $$href{beer}\n"; # shortcut

    # or to iterate over the hash:

    foreach my $key ( keys %$href ) {
        print "$key : $$href{$key}\n";
    }

How do references help us? One useful thing they allow us to do is to pass multiple arrays and/or hashes into a subroutine. Remember that a subroutine receives its parameters as one long flat list of arguments, so to create a subroutine that compares two arrays to see if they contain the same numerical elements we need to use references:

    my @one = (42, 13, 99, 72, 1);
    my @two = (42, 13, 99, 72, 1);
    if ( cmp_arrays(\@one, \@two) ) {
        print "Arrays are the same\n";
    } else {
        print "Arrays are not the same\n";
    }

    sub cmp_arrays {
        my ($one, $two) = @_;
        return 0 unless @$one == @$two; # check sizes
        foreach my $i (0 .. @$one - 1) {
            return 0 unless $$one[$i] == $$two[$i];
        }
        return 1;
    }

Next week we will look at alternate dereferencing methods and building nested data structures such as multidimensional arrays and hashes.

*****