Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= July, 2008
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Simple Object Oriented Tutorial (soot part I)

Andrew L. Johnson (First published by ItWorld.com 2001-05-31)

Object Oriented Programming (OOP) means a lot of different things to a lot of different people — but for our purposes let’s just say that OOP is simply a way to create a "thing" that your programs can use (obviously, this is rather vague at the moment).

Before we delve into making a simple object, let’s look at using an object and explore a few concepts. Consider a radio, the kind your grandfather might have had — a large wooden box with a speaker, a volume dial, and a tuning dial with the frequencies shown. We do not necessarily need to know what the insides look like, or how it functions, we need only know how to operate this device (we need to know how to use the device’s interface). Radio’s have changed a great deal since those days — instead of tubes inside we have solid-state circuits — and they are generally much smaller. But the interface has hardly changed at all: We still have a speaker, and volume and tuning controls (though we may have a digital display of the radio frequencies, and the controls may operate by mere touch rather than actually turning a dial). Everything on the inside has changed, but the only difficulty grand-dad might have using it is finding a station that played something he could stand to listen to. We will return to this radio analogy as we progress.

Now let’s look at using an existing object oriented module in Perl. The LWP modules provide us with easy methods of fetching web pages from the Internet (the modules are available in the libwww package on CPAN). The following is a very simple script to fetch a web page, based on an example in the lwpcook.pod documentation that comes with the module set:

    #!/usr/bin/perl -w
    use strict;
    use LWP::UserAgent;

    my $ua = LWP::UserAgent->new();
    $ua->agent("TestBot/0.1"); # pretend we are very capable browser

    my $req = HTTP::Request->new('GET' => 'http://www.perl.com/');
    $req->header('Accept' => 'text/html');

    # send request
    my $res = $ua->request($req);

    # check the outcome
    if ($res->is_success) {
       print $res->content;
    } else {
       print "Error: " . $res->status_line . "\n";
    }

To fetch a web page, we have to open a socket and communicate with a web server on a particular port number and request the page using the HTTP protocol. But conspicuously missing from the above program is anything resembling socket handling or communication with a web-server. Instead we created two objects, told those objects what we wanted, and they handled all the messy details. All we needed to know was how to use the objects themselves — just like I do not need to know how my radio actually tunes in radio stations and plays music seemingly out of thin air, I only need to know how to operate the controls.

We first pulled in the LWP::UserAgent module (this provides with the "agent" that does the talking to a web-server), which also pulls in the HTTP::Request module. We obtained a new UserAgent object by calling the new() method of the LWP::UserAgent class.

Brief working definitions: A class is something that defines objects and their methods; a method is simply a subroutine that is connected to a class or objects defined by a class. We will get to these in a later segment of the tutorial.

The documentation tells us that we should give our agent a name using the agent() method (called as: $ua->agent("name/version")). Next we construct an HTTP request (what web page are we requesting), by creating a new() HTTP::Request object and telling what URL we want to GET. We also use the request object’s header() method to tell it we will Accept a text/html document.

Internally, these two objects "know" what to do when their methods are called, we only have to press the right buttons or turn the right dial. So, now that we have an agent and a request, we ask the agent to make the request() and assign the result to a new variable. This result is also an object (we did not create it, the request method gave it to us) that knows if it was successful or what error might have occurred. We need only test it for success, or print the error.

In the real world, we might have a radio tuned to a particular station, but that doesn’t help us hear a particular song — for that we’d need to make a request to the station (the radio won’t help us here). We may call the station on the phone, talk to the DJ to make our request, and he may tell if he’ll play the song and we can sit back and wait for it. If we had a box next to our radio that allowed us to type in a radio station and a song request and just hit a button, that would not be dissimilar to the above snippet to fetch a web-page. All the messy details of phone numbers and talking to the DJ would be hidden from us, and we’d simply have to know what buttons to push on our little ‘Radio::Song::Request’ black box.

This initial segment of our tutorial is only meant to give you a handle on the basic concept of an object. In subsequent installments we will begin building our own little black box.

Special Perl Variables

Andrew L. Johnson (First published by ItWorld.com 2001-05-24)

Last week we discussed the input record separator variable ($/), one of Perl’s special global variables. Perl has a large number of special variables (all listed and explained in the perlvar manpage), but you really only need to be familiar with a handful (or two) for most programming requirements. The following subset lists the most common such variables, the remainder can be looked up in perlvar as needed.

    $_      The default variable (often used, seldom seen)
    $/      Input record separator (default is "\n")
    $\      Output record separator (default is "")
    $,      Output field separator (default is "")
    $"      Field separator for interpolated arrays (default is " ")

    $|      Autoflush variable for currently selected filehandle
    $ARGV   Name of current file being read by <ARGV>
    $.      Current line number being read

    $0      The name of the current script
    $$      The current process id

    $1..$N  Captured data in regular expressions

A couple of things to be aware of: $ARGV provides the name of the current file being read via <ARGV> — and you should remember that whenever you do something like:

    while (<>) {
        print;
    }

The empty <> is either reading from STDIN, or ARGV (the latter if there were any arguments in the @ARGV array). When reading from ARGV, the $ARGV variable will be set to each filename in turn. Also, when reading from ARGV, the $. variable does not automatically reset between files, so it will represent the current total line number (see the documentation for the eof() function to work around this).

There are also some special array and hash variables you need to know:

    @ARGV   The command line argument array
    @_      The subroutine argument array

    %ENV    Hash of environment variables
    %INC    Hash of filenames that have been included (via do(),
            require() or use).
    @INC    Search paths to find included files

    @EXPORT     List of things to export from a module by default
    @EXPORT_OK  List of things to export from a module on demand
    @ISA        Inheritance

The latter ones listed above are only relevant for creating modules, and you shouldn’t need them for general programming. The @INC array lets Perl know where to look modules or libraries that you include via ‘use’, do(), or require(). By default it holds all the necessary paths created when perl itself was built and installed (which is where most new modules will be installed as well). Sometimes you need to install modules in non-standard places and you will need to be able get these paths into the @INC array so perl can find them.

There are a couple of ways you can modify the @INC array. The first is to use the PERL5LIB environment variable — if that is defined when you start your script, then the paths defined will be prepended to the @INC array. To add paths to this array within your script, you can use the ‘use lib’ pragma:

    #!/usr/bin/perl -w
    use strict;
    use lib qw(/home/jandrew/perl/lib);
    use MyModule;
    ...

The above first prepends the path ’/home/jandrew/perl/lib’ to the @INC array, then the call to ‘use MyModule’ will search that path first to try to locate the module in question.

*****

More Record Oriented Reading

Andrew L. Johnson (First published by ItWorld.com 2001-05-17)

Last week we looked at three common cases for defining the ‘record’ that Perl’s input operator reads: line oriented records, paragraph oriented records, and file oriented records. Now we’ll look at grabbing and parsing records in a more custom format.

One fairly common type of multi-line record is the name/value pair format:

    name = andrew
    age = 37
    beer = dark ale
    *****
    name = john
    beer    = pale ale
    age=35
    *****

In this case we have records consisting of three fields and the record separator is "*****\n". One simple thing we can do is to read this data in and output it with a different record separator:

    $/ = "*****\n";
    while(<>){
        chomp;
        print "$_#####\n";
    }

Not very exciting, but it illustrates the concept of the input record separator. We set the input record separator to the string "*****\n" and then when we read from the file, the input operator reads one multi-line record at a time: everything up to and including the next record separator string is returned. This also illustrates that chomp() removes any trailing record separator, not just any trailing newline character (and not just the last character of the string as does the chop() function). In the above we are merely removing the record separator and producing output with a new separator string.

A slightly less trivial task might be to convert such a file into a csv style data file. Let’s assume we’ve been asked to read a file such as the above and produce a csv file with a ’:’ as the field separator and the fields should be ordered: name, age, beer.

    $/ = "*****\n";
    my @fields = qw/name age beer/;
    while(<DATA>){
        chomp;
        my %hash = split /\s*[\n=]\s*/;
        print join(':',@hash{@fields}),"\n";
    }

    __DATA__
    name = andrew
    age = 37
    beer = dark ale
    *****
    name = john
    beer    = pale ale
    age=35
    *****

Here we’ve chomp()’d off the record separator and created a hash by splitting the record at every ’=’ and "\n" character. We were careful to account for possible whitespace around the ’=’ character in our split(), but we may need to be a little more careful: What if one of the field lines is indented? In that case we would have a hash key like " name" instead of just "name", and our hash slice to produce the output wouldn’t work as we wished. A more careful approach would be to split the data into an array and strip any leading spaces before we create our hash:

    $/ = "*****\n";
    my @fields = qw/name age beer/;
    while(<DATA>){
        chomp;
        my @data = split /\s*[\n=]\s*/;
        s/^\s+// for @data;

        my %hash = @data;
        print join(':',@hash{@fields}),"\n";
    }

    __DATA__
        name = andrew
    age = 37
    beer = dark ale
    *****
    name = john
      beer    = pale ale
    age=35
    *****

Whatever the form of your plain text database, be it csv or multi-line records, Perl makes it rather easy to process the data one record at a time.

*****

Reading Record Oriented Data

Andrew L. Johnson (First published by ItWorld.com 2001-05-10)

When we read data files in Perl using the input operator (the <> operator), we are, by default, reading one line at a time. However, it is helpful to realize we aren’t really processing one line at a time but rather one record at a time — it is just that the default record separator happens to be a newline character, which is a useful default value.

Very often we will process files that have no real record structure (in the sense of records containing specific fields), such as when we want to print out lines in an ordinary text file that match a certain pattern, or perhaps when doing some search and replace operations on a plain text article we have written. In these cases, treating lines as records works as we expect.

More importantly, one of the most common forms of structured plain text data files (often just called flat-file databases) is the csv format (comma separated values — although, other characters besides commas are commonly used). Such records are usually terminated by newlines (one record to a line) and contain various fields separated by a specific character (or character string):

    name, rank, serial number

Processing such files is often simply a matter of:

    while(<>){
        chomp;
        my @fields = split /,/;
        process_fields(@fields);
    }

Assuming a suitable ‘process_fields()’ subroutine is defined.

However, we are certainly not limited to reading one-line records, whether in structured or unstructured files. The input record separator variable in Perl is $/ and we can set that variable to any string we wish to use as a record separator (the default, as mentioned, is "\n").

In unstructured text we may wish to use a blank line as a record separator — attempting to read in one paragraph at a time for processing. One way would be to set:

    $/ = "\n\n";

Anytime there are two newlines in a row, there must be a blank line in the file. Unfortunately, this is not very fault tolerant: If we had a file with 2 blank lines between 2 paragraphs, we’d read the first paragraph, and our second "paragraph" would contain an initial blank line. Worse, if we happened to have several blank lines between 2 paragraphs, we’d be reading in one or more empty paragraphs (records).

Fortunately, Perl has a special case for ‘paragraph’ records. Assigning an empty string to $/ puts Perl into paragraph mode and multiple blank lines are squashed into a single blank line (meaning we won’t get a bunch of empty paragraphs, but on the other hand, our output won’t have those extra blank lines either).

Another useful value for $/ is the undefined value (Perl’s ‘undef’ keyword). When we are reading a file in a typical while(<>) loop, the <> operator returns undef at the end of the file. Thus, if we set $/ to undef, we will read in the whole file as a single record (a single string). This can make a program more efficient speed-wise because we can operate on the whole file as a single string. Compare these two snippets:

    $/ = undef;       #   while(<>){
    $_ = <>;          #     tr/A-Z/a-z/;
    tr/A-Z/a-z/;      #     print;
    print;            #   }

They both accomplish the same thing (translating uppercase letters to lowercase letters), but the one on the left only calls the tr/// operator once while the second calls it once for each line in the file. for small to moderately sized files, the first will be faster. However, if the file is very large (relative to your available memory), then reading in the whole file may exhaust your memory entirely (which means the program won’t work at all), or perhaps cause your OS to start swapping (which will likely make it perform much worse than the version on the right).

*****

Speeding up a function with Memoization

Andrew L. Johnson (First published by ItWorld.com 2001-05-03)

Often we may have one or more functions in a program that take a non-trivial amount of time to compute a result. If such a ‘slow’ function may be called repeatedly, and often with the same arguments as previous calls, caching the return values may result in significant speed increases in the overall program.

Consider, for example, a subroutine ‘factorial’ that returns the factorial of its integer argument. A recursive version might be:

    sub factorial {
        my $n = shift;
        print "Computing factorial($n)\n";  # just for testing!!
        return undef if $n < 0;
        return 1 if $n == 0;
        return $n * factorial($n - 1) ;
    }
    print factorial(4), "\n";
    print factorial(5), "\n";

The print statement in the above is for illustrative purposes only, and running the above snippet produces:

    Computing factorial(4)
    Computing factorial(3)
    Computing factorial(2)
    Computing factorial(1)
    Computing factorial(0)
    24
    Computing factorial(5)
    Computing factorial(4)
    Computing factorial(3)
    Computing factorial(2)
    Computing factorial(1)
    Computing factorial(0)
    120

The problem here is that even though we had already computed the factorial of 4 (and 3, and 2, and 1, and 0) in the first call, we then recomputed all of those again to get the factorial of 5. By caching the return value we can save unnecessary recomputing of results:

    {
        my @cache = ();
        sub factorial {
            my $n = shift;
            print "Called factorial($n)\n";          # testing
            return $cache[$n] if defined $cache[$n];
            print "\tComputing factorial($n)\n";     # testing
            return undef if $n < 0;
            return 1 if $n == 0;
            return $cache[$n] = $n * factorial($n - 1) ;
        }
    }
    print factorial(4), "\n";
    print factorial(5), "\n";

Now we have used an array (@cache) to store the results of calls to the factorial() function and we can look up those values easily. Running the above produces:

    Called factorial(4)
            Computing factorial(4)
    Called factorial(3)
            Computing factorial(3)
    Called factorial(2)
            Computing factorial(2)
    Called factorial(1)
            Computing factorial(1)
    Called factorial(0)
            Computing factorial(0)
    24
    Called factorial(5)
            Computing factorial(5)
    Called factorial(4)
    120

So we have a significant savings when we called the subroutine the second time (it only needed to retrieve the result of the previous call to factorial(4) instead of recomputing it and its recursive dependents). Imagine the savings if we had called factorial(50) followed by factorial(51).

The Memoize module automates this caching for us by providing a function called memoize() which we can use to install cached versions of our ‘slow’ functions. Using this module and our original factorial() function we get:

    #!/usr/bin/perl -w
    use strict;
    use Memoize;
    memoize('factorial');

    print factorial(4), "\n";
    print factorial(5), "\n";

    sub factorial {
        my $n = shift;
        print "Computing factorial($n)\n";  # just for testing!!
        return undef if $n < 0;
        return 1 if $n == 0;
        return $n * factorial($n - 1) ;
    }

Which gives output of:

    Computing factorial(4)
    Computing factorial(3)
    Computing factorial(2)
    Computing factorial(1)
    Computing factorial(0)
    24
    Computing factorial(5)
    Computing factorial(4)
    120

And we see that although the factorial(5) resulted in a call to factorial(4), no further recursion occurs because the memoize() function has installed a cached version of the same function for us.

Memoization (or caching) is a very good technique when it is applicable, and it is not limited to functions that take just a single integer argument. The documentation for the Memoize module will instruct you on using multiple arguments and normalizing the arguments for functions with variable-order arguments. There are also ways to "expire" values in the cache to free up memory (perhaps if the function has not been called for some specified amount of time its value in the cache will be deleted).

The documentation also points out three particular cases where Memoization is not appropriate: 1) a function that uses (depends on) variables or state not defined within the function or its arguments; 2) a function that has side effects (does other things besides simply computing and returning a result); and 3) a function that returns a reference that is modified by the caller.

*****