Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc1
Article Calendar
<= July, 2008
S M T W T F S
12345
6789101112
13141516171819
20212223242526
2728293031
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

Using Here-docs

Andrew L. Johnson (First published by ItWorld.com 2001-02-22)

Perl has several quoting mechanisms, but the here-doc is the most convenient for quoting multiline strings. You may be familiar with such constructs from shell programming. The basic syntax is to follow doubled left-angle brackets with a terminator string and a semi-colon, all lines after that up to the terminator are quoted.

    my $string = <<EOS;
    this is part of the string
    and so is this
    EOS

    # now we are out of the quoted string

There are a few points to be aware of: By default, such a string is considered to be double quoted (so variable interpolation works as expected); and the final terminator should be flush with the left margin and by itself on a line (no trailing spaces or tabs or anything else). Thus, even though all the code is indented here, you are to assume that it is flush left. To achieve a single version you surround your initial terminator specifier with single quotes:

    my $string = <<'EOS';
        this is a
        single quoted
        here-doc
    EOS

Note, all indentation and spacing is preserved, this is simple a multiline string. Besides single or double quotes, you can also use backtics to have your multiline string executed as shell commands:

    print <<`SHELL`;
    echo Hello
    ls -l
    SHELL

One good reason to use here-docs is to avoid multiple print statements, say when printing several lines of HTML:

    print "<html>\n";
    print "<head><title>Whatever</title></head>\n";
    print "<body>\n";
    # and more of the same

    print <<HTML;
    <html>
    <head><title>Whatever</title></head>
    <body>
    HTML

Now we’ve avoided explicitly using quotes and newlines and our output is contained in a nice little chunk.

Here-docs can also be stacked, either for direct printing (not so useful perhaps) or when providing multiline quoted strings as arguments to some function:

    $two = "this is the second\n";
    print <<FIRST, $two, <<THIRD;
    Here is the
    first string
    FIRST
    And here is
    the third
    THIRD

    foo(<<FIRST, $two, <<THIRD);
    Here is the
    first argument
    FIRST
    And here is
    the third
    THIRD

    sub foo {
        my($first, $second, $third) = @_;
        print "$first$second$third";
    }

The perldata manpage documents here-doc syntax along with a method a allowing for indentation. The faqs (perlfaq4 in particular) also shows several methods of arranging various indentation and block-like formatting of here-docs.

Using File::Find

Andrew L. Johnson (First published by ItWorld.com 2001-02-15)

Reading the contents of a directory is easy using the opendir() and readdir() built-in functions, but once you need to dig a little deeper in the filesystem (recursively searching one or more directories), this method becomes cumbersome and sometimes tricky if your system supports symbolic links.

The File::Find module (which ships with perl) makes traversing directories and acting on the contents very easy and safe. The module exports one function (find()) which can be called in two ways. In the simplest method, you pass it a function reference and a list of directories to search. The function reference will contain the code you want to use to process each filesystem entry found by the find() function. For example, let’s say we want to do some routine cleanup in a directory. I often use LaTeX for my writing and generate either .dvi, .ps, or .pdf files for typeset output. I also often fail to delete these generated files and they end up needlessly using space. Even though hard drive space is cheap these days, I still want to clean out my writing directory from time to time (which contains a large number of subdirectories of various depths).

If I haven’t accessed a generated typeset file (of any variety) for more than say a week, I might as well delete it (I can always regenerate it from the LaTeX sources if I need to print it again later).

    #!/usr/bin/perl -w
    use strict;
    use File::Find;
    my @dirs = @ARGV or die "No valid directory argument(s)";
    find( sub{
              m/\.(dvi|ps|pdf)$/
              and -A > 7
              and print "$File::Find::name\n";
          },
          @dirs,
    );

The first argument to the find function is my subroutine reference (in this case an anonymous subroutine, but we could have used a reference to a real subroutine), and then the list of arguments from the command line. When find() runs it does a chdir() into each directory in turn (and any subdirectories) and sets $_ to be each filename found (the complete pathname of the file is given in the special package variable: $File::Find::name). So we first check if has one of the extensions we are interested in, and that its access time is greater than seven days and then we print the full pathname. In the real program we would use the unlink() function to delete each such file rather than just printing it (but if you’re developing such a script, please test it with print() until you are sure it is finding only what you wanted before you start deleting files with the unlink() function).

The alternate method of calling find() is to pass a hash reference as the first argument, and then the list of directories to search. The hash allows you to specify alternate behaviors for the find() function. There are several keys you may set in this hash, the more common ones are:

    wanted  => your sub reference
    bydepth => depth first searching
    follow  => follow symbolic links (be careful!)
    nochdir => don't chdir() into each directory

A few variables are also available to you within your subroutine, the most useful are:

    $_                    name of file
    $File::Find::dir      directory currently being searched
    $File::Find::name     Full pathname of file

File::Find can be utilized for all sorts of filesystem searching and administrative tasks and is a good tool to add to your repertoire.

*****

The Dating Game (Date::Calc and Date::Manip)

Andrew L. Johnson (First published by ItWorld.com 2001-02-08)

Working with dates is a common enough task and Perl comes with a couple of built-in functions for lower level date calculations (localtime() and gmtime()) and a Time::Local module which provides complementary functions (timelocal() and timegm()). However, for doing higher level date calculations you’ll want to grab either the Date::Manip or Date::Calc modules from CPAN.

The Date::Manip module is pure Perl (so it’ll install and run anywhere you have Perl) and is quite large. The Date::Calc module uses a C library so you’ll need to compile your own or get a prebuilt version (activestate has a version for win32). Date::Calc is leaner, much faster, and probably the one you want to work with (we’ll use it in the examples here).

The Date::Calc module contains a large number of convenience routines such as: Day_of_Year(), check_date(), leap_year(), Delta_Days(), Date_to_Text(), and too many more to list. For example, to check to see if a given date is valid:

    use Date::Calc qw/check_date/;
    print "Bad Date\n" unless check_date(2001, 9, 31);

This tells us that the date is bad because there is no September 31st. We can calculate the difference between two dates with the Delta_Days function:

    use Date::Calc qw/Delta_Days/;
    my ($year1, $month1, $day1) = (2001, 2, 8);
    my ($year2, $month2, $day2) = (2001, 12, 25);
    my $delta = Delta_Days($year1, $month1, $day1,
                           $year2, $month2, $day2);
    print "Only $delta more days until Christmas!\n";

Have you been told your invoice will be paid in 45 days and want to circle that date on your calendar?

    use Date::Calc qw/Add_Delta_Days Date_to_Text_Long/;
    my ($year, $month, $day) = (localtime())[5,4,3];
    $year += 1900;
    $month++;
    print "Enter number of days from today: ";
    chomp(my $ddays = <STDIN>);
    my $date = Date_to_Text_Long(
                    Add_Delta_Days($year, $month, $day, $ddays)
               );
    print "$date is $ddays from today\n";

You can also get or add deltas with hours, minutes, and seconds for more precise calculations. And, of course, there is a good deal more functionality in the Date::Calc, and the Date::Manip modules than I’ve highlighted here.

Working with dates isn’t hard, but it also isn’t hard to work with them incorrectly — using one of the Date modules helps you avoid mistakes and often greatly simplifies your task.

*****

MD5 Checksums with Digest::MD5

Andrew L. Johnson (First published by ItWorld.com 2001-02-01)

MD5 is a one-way hashing algorithm commonly used to create ‘fingerprints’ or ‘signatures’ of strings (often whole files). The Digest::MD5 module (available from CPAN) creates 128 bit signatures of any input string.

MD5 is often used as an encryption method — you can take a password, create the MD5 fingerprint and store it. Later, when a user logs in again you can take their password, create an MD5 digest and compare it to the stored digest. They are also frequently used as file signatures — if you make a file available for download you could also supply an MD5 digest and when a user downloads your file they can create a new digest and check it against yours to make sure the file has not been corrupted. Finally, if you created a database of signatures for all crucial files on your system (say all the system binaries), you could regularly run a script that tested if any files had been modified by creating new digests and comparing them with the ones stored in the database (this could alert you to the fact that your system has been cracked).

Let’s take a quick look at creating a digest from a string or a file (in the first case we’ll use the functional interface and in the second we will use the OO interface):

    #!/usr/bin/perl -w
    use strict;
    use Digest::MD5 qw/md5/;
    my $string = "This is a test";
    my $digest = md5($string);

    my $md5 = Digest::MD5->new();
    open(FILE, "./somefile")||die "Can't $!";
    $md5->addfile(*FILE);
    close FILE;
    my $file_digest = $md5->digest;

If we want to monitor a directory for changes in files we could first create our database of file digests (perhaps using a tied hash with the DB_File module). Now, we periodically wish to run through this directory and test each file against its digest.

    #!/usr/bin/perl -w
    use strict;
    use DB_File;
    use Digest::MD5;

    my %MD5hash;
    tie %MD5hash, 'DB_File', 'md5.db';

    my $dir = './testdir';
    opendir(DIR, $dir)||die "Can't $!";
    my @files = grep{-f} readdir DIR;
    close DIR;

    foreach my $file (@files) {
        my $md5 = Digest::MD5->new();
        open(FILE, "$dir/$file")||die "Can't $!";
        $md5->addfile(*FILE);
        my $digest = $md5->digest();
        close FILE;
        next unless exists $MD5hash{$file};
        if($MD5hash{$file} ne $digest){
            print "Hey, $file has changed!!!\n";
        }
    }

    untie %MD5hash;

You would probably replace the ‘next unless’ line to notify you that a new file has been added to the directory so that you could check to see that it is supposed to be there and rerun your original script to update your database (that original database script is not very different from the one above, it just creates the entries rather than checking them).

*****

Invoking Subroutines

Andrew L. Johnson (First published by ItWorld.com 2001-01-25)

There are a couple of different ways to invoke a subroutine, and I sometimes get asked which way is needed when and what are the differences among them. The question arises because people see some code that uses the ampersand (&) for function calls, and other code that does not, and some code that uses parentheses and some that does not.

Do we need the ampersand, and if so, when. The short answer is no, we do not need the ampersand to invoke a function (but we do need it to take a reference to an existing named function). Here’s the main ways of calling a function you’ve created named ‘foo’:

    &foo();     # calls foo with no arguments
    &foo($arg); # calls foo with $arg
    foo();      # calls foo with no arguments
    foo($arg);  # calls foo with $arg

Doesn’t seem to be much difference right? Well, there is one difference and that is that if ‘foo’ is defined with prototypes, the versions above using the ampersand disable those prototypes (which could lead to subtle bugs if you are using a module where the routines have prototypes and you are simply ignoring those prototypes by using the ampersand to call its functions).

All of the above used parentheses in the calls, but parentheses are not always required — if you define or declare the subroutine prior to calling it then you may call it without parentheses:

    sub foo {
        print "you passed: @_\n";
    }
    foo;
    foo 1, 2, 3;

Now you can call ‘foo’ without parentheses just like a built-in routine may be called without parentheses. However, if you use the ampersand and no parentheses, something very different happens: The function is called using the current value of @_ as the argument list. This can be used to pass arguments on from one function to another:

    foo(1,2,3);   # call foo with (1,2,3)
    foo();        # calls foo with no arguments

    sub foo {
        die "no arguments given to foo" unless @_;
        &bar;  # same as: bar(@_);
    }

    sub bar {
        print "You passed: (@_ ) to foo()\n";
    }

The above all holds true for subroutine references as well (except there aren’t prototypes on sub references):

    my $sref = sub{print "You passed: @_\n";};
    @_ = (1,2,3);

    &$sref();     # calls sref with no args
    $sref->();    # calls sref with no args
    &$sref;       # calls sref with @_

However, if you want to take a reference to a named subroutine, then you have to use an ampersand to get the reference:

    sub foo {print "You passed: @_\n"}

    my $sref = \&foo;
    $sref->(1,2,3);

With all this variation, what’s the best way to do things? Different folks have different answers, but my answer is to always call a function without the ampersand, and always use parentheses. This rule is very simple, leads to clean code, and won’t give you surprises with regards to prototypes if you ever decide to use them (or use another module that uses them).

(note: &bar; and bar(@_) are actually slightly different: the former calls bar() reusing the current @_ array, the latter passes a copy of the @_ array — not that makes a difference for most purposes).