Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= July, 2008
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Sorting in Perl

Andrew L. Johnson (First published by ItWorld.com 2000-11-23)

Perl’s sort function is a little different than the standard built-in functions — like map() and grep() it can accept a block as a first argument, but it can also accept a named subroutine, or a reference to a subroutine. This allows you to specify the comparison method used in sorting — the default sorting method is alphabetical (ascii-betical) and the comparison operator is the ‘cmp’ operator. Thus, the following two examples are the same:

    my @sorted = sort @unsorted;
    my @sorted = sort {$a cmp $b} @unsorted;

Inside the block or subroutine you use the package-global variables $a and $b to refer to the elements being compared (these are specially exempt from the ‘use strict’ pragma for just this reason). To specify a reverse alphabetical sort you switch the variables:

    my @reverse_sorted = sort {$b cmp $a} @unsorted;

You can define any comparison method you like, but you must make sure that your block or subroutine returns only -1, 0, or 1 depending on the ordering you want (the ‘cmp’ and ’<=>’ operators are usually used for this purpose).

To sort a list numerically, we’ll use the <=> operator:

    my @ascending  = sort {$a <=> $b} @unsorted;
    my @descending = sort {$b <=> $a} @unsorted;

A hash does not give us a way to store information in an ordered fashion but we often need to print out a hash’s contents in some sorted order. If we wish to print out a hash sorted by keys (alphabetically) we can simply do this:

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9, Joey => 16);
    foreach my $name (sort keys %ages) {
        print "$name : $ages{$name}\n";
    }

But what if I wanted to produce output sorted by age from oldest to youngest rather than names (by hash values rather than keys)? That’s also very easy:

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9, Joey => 16);
    foreach my $name (sort { $ages{$b} <=> $ages{$a} } keys %ages) {
        print "$name : $ages{$name}\n";
    }

In this case, we still get a list of keys, but in the comparison block we compare the hash values for each key rather than the keys themselves — thus, we are iterating over the list of hash keys sorted in descending numerical order by the hash values for those keys. Now, let’s consider the case where we want to dual sort — that is, if two ages (values) are equal we want to order data alphabetically by the name (key). We’ll have to add some additional children to my hash to demonstrate:

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9,  Joey => 16,
                 Karen  => 11, John => 9, Kevin  => 16, Lisa => 9);
    foreach my $name (sort by_age keys %ages) {
        print "$name : $ages{$name}\n";
    }

    sub by_age {
        $ages{$b} <=> $ages{$a}
                  ||
               $a cmp $b
    }

In this case, I’ve use a separate routine ‘by_age’ instead of a block. In the routine, the <=> returns 0 if the ages are equal and this is a false value so Perl looks to the other side of the logical OR operator (||) and evaluates the ‘cmp’ operation to get the return value of the function.

Starting with version 5.6.0 of Perl you no longer have to use $a and $b as your comparison variables when you use a separate subroutine. If you prototype the subroutine with ($$) then the arguments are passed to the routine in the @_ array:

    sub by_age ($$); # declare with proto before using

    my %ages = ( Andrew => 37, Sue => 41, Thomas => 9,  Joey => 16,
                 Karen  => 11, John => 9, Kevin  => 16, Lisa => 9);
    foreach my $name (sort by_age keys %ages) {
        print "$name : $ages{$name}\n";
    }

    sub by_age ($$) {
        $ages{$_[1]} <=> $ages{$_[0]}
                     ||
               $_[0] cmp $_[1]
    }

This is slower than using $a and $b, but it means you can use sort routines defined in other packages without having namespace problems. (see: ‘perldoc -f sort’ for further documentation).

In next week’s article we will examine ways to sort more complicated data such as sorting multi-field records on any field.

*****

Persistent Hashes and DBM files

Andrew L. Johnson (First published by ItWorld.com 2000-11-16)

Hashes are quite useful as databases — you store key/value pairs and later look up the data associated with any key. Wouldn’t it be nice if we could connect a Perl hash directly to a DBM file on disk so we could easily have a persistent hash database? We can do exactly that by using the tie() function to tie a hash to a dbm file.

Perl ships with support for four types of DBM files (which ones will be compiled into your perl depend on what is available on your platform when you built perl). These are: NDBM, DB_File, GDBM, and SDBM. DB_File is the most flexible (so we will use it here), but NDBM is widespread (was used with perl 4) and SDBM libraries ship with perl so you always have at least one possibility to use.

Tieing a hash to a DBM is quite simple — you merely ‘use’ the appropriate DB module, declare a hash to tie, and call the tie() function to complete the process of associating the hash with the DBM file.

    #!/usr/bin/perl -w
    use strict;
    use DB_File; # or NDBM_File, GDBM_File, SDBM_File, AnyDBM_File
    my %dbase;
    my $file = 'phones.db';
    tie(%dbase, 'DB_File', $file) || die "can't open $file: $!";

    $dbase{andrew} = '555-1234'; # not really
    $dbase{john}   = '555-4321';

    foreach my $key (keys %dbase) {
        print "$key: $dbase{$key}\n";
    }

    untie(%dbase);
    __END__

The call to tie() takes 3 arguments: the hash to tie; the classname of the DB module, and the filename for the database (additional arguments can set flags and permissions on the database file — see the documentation for details).

The above example defines a hash and then tie()’s it to a DB_File DBM file. We then enter two key/value pairs (names and phone numbers) and iterate over the hash. The big deal is, if you run this program and then comment out the two lines assigning to the hash, you’ll find that it still prints out the same data because the data persists in the DBM file ‘phones.db’. We now have a persistent hash — it works just like any other hash, but it is stored on disk as well.

One thing that I need to mention is how we iterate over the hash. In the above example we simply used the keys() function to iterate over the list of keys in the hash. If this were a large database, this could take more memory than you desire. A better way to iterate over the hash is by using the each() function which returns key/value pairs sequentially (without building the entire list in memory):

    while( my($k, $v) = each(%dbase) ) {
        print "$k: $v\n";
    }

You can use this mechanism for all sorts of persistent data. Netscape maintains its history list of recently visited web-sites between invocations by using such a mechanism. You could build an address book application using DBM files, or you store configuration information for your application.

The DB_File module also allows you to tie() a hash to a BTREE structure (which can maintain an order on your keys) or to tie a hash to a flat text file using line numbers as keys for each line (record) in the file.

For further information see the following documentation:

    perldoc -f tie
    perldoc AnyDBM_File
    perldoc DB_File

*****

Hash Basics

Andrew L. Johnson (First published by ItWorld.com 2000-11-09)

A hash (associative array) is a structure that stores data in the form of key/value pairs. The keys are unique strings, and the values can be any scalar value (string, number, or reference). In essence, a hash is a look-up table, or, in some people’s terms, a dictionary.

Hashes have a % prefix. You can assign a simple list to a hash and it will be interpreted as sequential key/value pairs. Consider a set of translations of color names from English to French:

    my %color_map = ('red', 'rouge', 'blue', 'bleu', 'green', 'vert');

In the above hash the string ‘red’ is the key for the value ‘rouge’. Perl gives us an alternative to the comma that is particularly useful in making hash assignments more readable — this is the ’=>’ or fat-comma:

    my %color_map = ( red => 'rouge', blue => 'bleu', green => 'vert');

The => operator is a synonym for an ordinary comma with one extra feature: It automatically quotes the string on the left. The second example makes it much more clear that we are assigning 3 associations to the hash rather than a 6-element list. You can access individual elements of a hash using the $ symbol and using curly braces to denote which key you want to access:

    print "$color_map{red}\n";  # prints: rouge

Now, consider a simple program that asks a user to input a color name and which then prints out the French translation. Without hashes you will probably wind up using a series if if/elsif tests to discover what color was entered and print out the appropriate translation. However, with the %color_map hash we need use a single if/else construct to see if the word has a translation in our hash and act accordingly:

    my %color_map = ( red    => 'rouge',
                      blue   => 'bleu',
                      green  => 'vert',
                      black  => 'noir',
                      purple => 'violet',
                      yellow => 'jeune',
                      white  => 'blanc',
                      brown  => 'brun',
                      grey   => 'gris',
                    );
    print "Enter a color in English: ";
    chomp( my $color = <STDIN>);
    $color = lc($color);
    if ( $color_map{$color} ) {
        print "$color translates to '$color_map{$color}'\n";
    } else {
        print "No translation available for '$color'\n";
    }

One thing to notice is that keys in a hash are case-sensitive, hence we used the lc() function to lowercase the user’s input before we looked up the color in the hash.

An important distinction between hashes and arrays are that hashes are not ordered — in the above example the key ‘red’ is not the first element in the hash. We can use the keys() function to retrieve the list of keys in a hash, and the values() function to retrieve the list of values. To walk through a hash and print out each pairing you can simply iterate over the keys:

    foreach my $key (keys %hash) {
        print "$key : $hash{$key}\n";
    }

The each() function returns key/value pairs in scalar context so you may also iterate over the hash like so:

    while( my($key, $val) = each(%hash) ) {
        print "$key : $val\n";
    }

A couple of earlier articles describe ways of using hashes and you might want to look them up in the archive (and archive link is provided below in the resources section) — see the articles dated July 20, 2000 and August 31, 2000.

*****

Array Basics

Andrew L. Johnson (First published by ItWorld.com 2000-11-02)

We’ve looked at arrays and some of the basic array operators in a previous article (back in March 2000), so in this article we’ll revisit arrays from a slightly different angle — when and how to use them.

Often I see cases where a list of data is assigned to an equal list of scalar variables as in the following hypothetical database reformatting task:

    #!/usr/bin/perl -w
    use strict;
    while(<>){
        chomp;
        my ($id, $lname, $fname, $department,
            $location, $shift) = split /:/;
        print "$lname|$fname|$id|$shift|$department|$location\n";
    }
    __END__

While assigning each field to a named scalar can be beneficial when many of them will be used in various places in the rest of the block (self documenting code), the above usage is trivial and unnecessary. If we are merely reformatting the data, or perhaps storing it in a larger data structure for later use, we do not need to create all of these variables:

    #!/usr/bin/perl -w
    use strict;
    while(<>){
        chomp;
        my @fields = split /:/;
        print join('|', @fields[1,2,0,5,3,4]),"\n";
    }
    __END__

In this case, we are just reformatting the data so rather than specify each field name in the new ordering, we need only re-join the fields in their new order using an array slice.

Another good place to use an array is in building a hash. That’s right, we can easily use an array as a list of keys when building a hash by using a hash-slice. Consider reading in the same type of data above but this time we wish to build a hash using field names for keys:

    #!/usr/bin/perl -w
    use strict;
    my @fields = qw/id lname fname department location shift/;
    while(<>){
        chomp;
        my %data;
        @data{@fields} = split /:/;
        process(\%data); # do something with %data hash
    }
    # process subroutine definition ...
    __END__

Besides being easy to follow, an advantage of the above method is that if you want to eventually process the data fields in the order they occur, you can loop through the array of keys rather than using the keys() function (which does not return keys in the order they were entered in the hash).

As a final array example, let pick a random element from an array:

    #!/usr/bin/perl -w
    use strict;
    my @array = qw/andrew john greg brad/;
    my $draw  =  @array[rand @array];
    print "I picked $draw\n";
    __END__

This works because the rand() function returns a number from 0 up to but not including the argument passed to it. In the above, it returns a number from 0 to 3.999… — and an array index always uses the integer value of the number provided, so in the above case the index can be 0, 1, 2, or 3. If you want to randomize an entire array in place then the perlfaq’s have a good example of a shuffle routine called the fisher_yates shuffle — see the entry in perlfaq4 entitled:

    How do I shuffle an array randomly?

*****