Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= May, 2012
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Sorting in Perl II (Schwartzian Transform)

Andrew L. Johnson (First published by ItWorld.com 2000-11-30)

Last week we explored using Perl’s sort() function for simple sorting uses. This week we will take a brief look at using the sort() function efficiently when the sort fields require extraction or some computation. We will do this by demonstrating the Schwartzian Transform (named after Randal Schwartz, and often abbreviated to just ST).

Let’s consider a simple case of wanting to sort colon separated records on one or more fields. Here is some sample data:

    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    qux:19:foo:1.2

Now, the following is one attempt to sort this data on the 3rd data field:

    #!/usr/bin/perl -w
    use strict;
    my @data   = <DATA>;
    my @sorted = sort custom @data;
    print @sorted;

    sub custom {
        (split /:/,$a)[2] cmp (split /:/,$b)[2];
    }
    __DATA__
    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    qux:19:foo:1.2

This works but there is an efficiency problem here — when sorting a list the comparison function (custom() in this case) is called many times, and each time in the above case it is performing 2 split() operations. A far better approach is to preprocess the data so that the sort fields may be more directly accessed, and any operations performed on the data need only be performed once per record. After sorting, the data can then be post-processed to obtain the original records. Here is one way of doing that with each step separated:

    #!/usr/bin/perl -w
    use strict;
    my @data = <DATA>;
    my @pre    = map { [$_, split /:/ ] } @data;
    my @post   = sort custom @pre;
    my @sorted = map { $_->[0] } @post;
    print @sorted;

    sub custom {
        $a->[3] cmp $b->[3];
    }
    __DATA__
    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    aux:19:foo:1.2

In the above we have first preprocessed our records into a list of anonymous arrays where the first element is the entire record and the remaining elements are the split() fields of that record. We then sort those anonymous arrays on the field we desire (in this case the third index of each anonymous array). Now our @post array contains a sorted list of anonymous arrays sorted on the correct field and we merely extract the first element of each anon array to recover our original records. The three steps above can be combined by simply passing the results of the preprocessing stage directly into the sort stage and passing those results through the post-processing stage and out into our sorted array:

    #!/usr/bin/perl -w
    use strict;
    my @data = <DATA>;
    my @sorted = map { $_->[0] }
                 sort custom
                 map { [$_, split /:/ ] } @data;
    print @sorted;

    sub custom {
        $a->[3] cmp $b->[3];
    }
    __DATA__
    foo:23:bar:2.1
    bar:42:qux:3.0
    baz:19:foo:1.1
    aux:19:foo:1.2

This ‘map/sort/map’ algorithm is what is referred to as the ST and provides a nice clean and relatively efficient way to sort on computed or extracted fields. We can also easily adjust our custom() comparison routine to sort on multiple fields as shown in last weeks column. For a more extended look at sorting, and an alternative efficient algorithm, please see the following article:

    http://www.hpl.hp.com/personal/Larry_Rosler/sort/sorting.html