Siaris
Simple Things
Syndicate: full/short
Siaris
Categories
General0
News2
Programming2
LanguageBits0
Perl50
Ruby10
VersionControl1
Misc0
Article Calendar
<= June, 2012
S M T W T F S
12
3456789
10111213141516
17181920212223
24252627282930
Search this blog

Key links
External Blogs
Brought to you by ...
Ruby
1and1.com

The Dating Game (Date::Calc and Date::Manip)

Andrew L. Johnson (First published by ItWorld.com 2001-02-08)

Working with dates is a common enough task and Perl comes with a couple of built-in functions for lower level date calculations (localtime() and gmtime()) and a Time::Local module which provides complementary functions (timelocal() and timegm()). However, for doing higher level date calculations you’ll want to grab either the Date::Manip or Date::Calc modules from CPAN.

The Date::Manip module is pure Perl (so it’ll install and run anywhere you have Perl) and is quite large. The Date::Calc module uses a C library so you’ll need to compile your own or get a prebuilt version (activestate has a version for win32). Date::Calc is leaner, much faster, and probably the one you want to work with (we’ll use it in the examples here).

The Date::Calc module contains a large number of convenience routines such as: Day_of_Year(), check_date(), leap_year(), Delta_Days(), Date_to_Text(), and too many more to list. For example, to check to see if a given date is valid:

    use Date::Calc qw/check_date/;
    print "Bad Date\n" unless check_date(2001, 9, 31);

This tells us that the date is bad because there is no September 31st. We can calculate the difference between two dates with the Delta_Days function:

    use Date::Calc qw/Delta_Days/;
    my ($year1, $month1, $day1) = (2001, 2, 8);
    my ($year2, $month2, $day2) = (2001, 12, 25);
    my $delta = Delta_Days($year1, $month1, $day1,
                           $year2, $month2, $day2);
    print "Only $delta more days until Christmas!\n";

Have you been told your invoice will be paid in 45 days and want to circle that date on your calendar?

    use Date::Calc qw/Add_Delta_Days Date_to_Text_Long/;
    my ($year, $month, $day) = (localtime())[5,4,3];
    $year += 1900;
    $month++;
    print "Enter number of days from today: ";
    chomp(my $ddays = <STDIN>);
    my $date = Date_to_Text_Long(
                    Add_Delta_Days($year, $month, $day, $ddays)
               );
    print "$date is $ddays from today\n";

You can also get or add deltas with hours, minutes, and seconds for more precise calculations. And, of course, there is a good deal more functionality in the Date::Calc, and the Date::Manip modules than I’ve highlighted here.

Working with dates isn’t hard, but it also isn’t hard to work with them incorrectly — using one of the Date modules helps you avoid mistakes and often greatly simplifies your task.

*****

MD5 Checksums with Digest::MD5

Andrew L. Johnson (First published by ItWorld.com 2001-02-01)

MD5 is a one-way hashing algorithm commonly used to create ‘fingerprints’ or ‘signatures’ of strings (often whole files). The Digest::MD5 module (available from CPAN) creates 128 bit signatures of any input string.

MD5 is often used as an encryption method — you can take a password, create the MD5 fingerprint and store it. Later, when a user logs in again you can take their password, create an MD5 digest and compare it to the stored digest. They are also frequently used as file signatures — if you make a file available for download you could also supply an MD5 digest and when a user downloads your file they can create a new digest and check it against yours to make sure the file has not been corrupted. Finally, if you created a database of signatures for all crucial files on your system (say all the system binaries), you could regularly run a script that tested if any files had been modified by creating new digests and comparing them with the ones stored in the database (this could alert you to the fact that your system has been cracked).

Let’s take a quick look at creating a digest from a string or a file (in the first case we’ll use the functional interface and in the second we will use the OO interface):

    #!/usr/bin/perl -w
    use strict;
    use Digest::MD5 qw/md5/;
    my $string = "This is a test";
    my $digest = md5($string);

    my $md5 = Digest::MD5->new();
    open(FILE, "./somefile")||die "Can't $!";
    $md5->addfile(*FILE);
    close FILE;
    my $file_digest = $md5->digest;

If we want to monitor a directory for changes in files we could first create our database of file digests (perhaps using a tied hash with the DB_File module). Now, we periodically wish to run through this directory and test each file against its digest.

    #!/usr/bin/perl -w
    use strict;
    use DB_File;
    use Digest::MD5;

    my %MD5hash;
    tie %MD5hash, 'DB_File', 'md5.db';

    my $dir = './testdir';
    opendir(DIR, $dir)||die "Can't $!";
    my @files = grep{-f} readdir DIR;
    close DIR;

    foreach my $file (@files) {
        my $md5 = Digest::MD5->new();
        open(FILE, "$dir/$file")||die "Can't $!";
        $md5->addfile(*FILE);
        my $digest = $md5->digest();
        close FILE;
        next unless exists $MD5hash{$file};
        if($MD5hash{$file} ne $digest){
            print "Hey, $file has changed!!!\n";
        }
    }

    untie %MD5hash;

You would probably replace the ‘next unless’ line to notify you that a new file has been added to the directory so that you could check to see that it is supposed to be there and rerun your original script to update your database (that original database script is not very different from the one above, it just creates the entries rather than checking them).

*****

Invoking Subroutines

Andrew L. Johnson (First published by ItWorld.com 2001-01-25)

There are a couple of different ways to invoke a subroutine, and I sometimes get asked which way is needed when and what are the differences among them. The question arises because people see some code that uses the ampersand (&) for function calls, and other code that does not, and some code that uses parentheses and some that does not.

Do we need the ampersand, and if so, when. The short answer is no, we do not need the ampersand to invoke a function (but we do need it to take a reference to an existing named function). Here’s the main ways of calling a function you’ve created named ‘foo’:

    &foo();     # calls foo with no arguments
    &foo($arg); # calls foo with $arg
    foo();      # calls foo with no arguments
    foo($arg);  # calls foo with $arg

Doesn’t seem to be much difference right? Well, there is one difference and that is that if ‘foo’ is defined with prototypes, the versions above using the ampersand disable those prototypes (which could lead to subtle bugs if you are using a module where the routines have prototypes and you are simply ignoring those prototypes by using the ampersand to call its functions).

All of the above used parentheses in the calls, but parentheses are not always required — if you define or declare the subroutine prior to calling it then you may call it without parentheses:

    sub foo {
        print "you passed: @_\n";
    }
    foo;
    foo 1, 2, 3;

Now you can call ‘foo’ without parentheses just like a built-in routine may be called without parentheses. However, if you use the ampersand and no parentheses, something very different happens: The function is called using the current value of @_ as the argument list. This can be used to pass arguments on from one function to another:

    foo(1,2,3);   # call foo with (1,2,3)
    foo();        # calls foo with no arguments

    sub foo {
        die "no arguments given to foo" unless @_;
        &bar;  # same as: bar(@_);
    }

    sub bar {
        print "You passed: (@_ ) to foo()\n";
    }

The above all holds true for subroutine references as well (except there aren’t prototypes on sub references):

    my $sref = sub{print "You passed: @_\n";};
    @_ = (1,2,3);

    &$sref();     # calls sref with no args
    $sref->();    # calls sref with no args
    &$sref;       # calls sref with @_

However, if you want to take a reference to a named subroutine, then you have to use an ampersand to get the reference:

    sub foo {print "You passed: @_\n"}

    my $sref = \&foo;
    $sref->(1,2,3);

With all this variation, what’s the best way to do things? Different folks have different answers, but my answer is to always call a function without the ampersand, and always use parentheses. This rule is very simple, leads to clean code, and won’t give you surprises with regards to prototypes if you ever decide to use them (or use another module that uses them).

(note: &bar; and bar(@_) are actually slightly different: the former calls bar() reusing the current @_ array, the latter passes a copy of the @_ array — not that makes a difference for most purposes).

Regular Expression Tutorial Part 6: Capturing, Grouping, and Backreferences

Andrew L. Johnson (First published by ItWorld.com 2001-01-18)

In this final installment of our regular expression tutorial we will look at capturing, grouping, and backreferences. We have looked at some of these briefly before, but I couldn’t very well have a multi-part tutorial without including them here.

Parentheses are used for both grouping subexpressions in a pattern, and capturing the text matched for later use in the special digit variables ($1, $2, $3, … etc). For example:

    $_ = 'this string has a bar in it';
    m/has a (foo|bar)/;
    print "$1\n";   # prints: bar

As we saw in the first installment, the grouping effect also limits the scope of the alternation operator. In the above, we match the sequence ‘has a ’ followed by either a ‘foo’ or ‘bar’ sequence and we capture whatever was matched in the parenthesized sub-pattern into the $1 variable.

We can also achieve grouping without capturing by using the grouping only construct: (?:pattern). Sometimes we do not wish to capture a particular subexpression, but merely group it for use with a quantifier, or alternator. This construct can be more efficient because there is no special variable associated with it that must be updated during the match.

We know we can capture something and use it later, either in the replacement side of a s/// operation, or in any expression that follows the match. We can also refer to the text previously matched within the same regular expression by using a backslash followed by the number of the subexpression you want to match again. For example, if we want to read through a wordlist (one word per line) and print out every word that contains two repeated letters we can do this:

    while(<>){
        print if /(\w)\1/;
    }

This means if we find a \w character immediately followed by the same character that was matched (not just followed by any other \w character) then we print that line. Consider how you might try to do that without backreferences. Here’s a very interesting (but not efficient) program for printing out prime numbers using backreferences in a regex (based on a posted to comp.lang.perl.misc by Abigail some time ago):

    #!/usr/bin/perl -w
    use strict;
    my $number = $ARGV[0] || 100;
    for(my $n = 2; $n <= $number; $n++){
        $_ = '1' x $n;
        print "$n is prime\n" unless /^(11+)\1+$/;
    }

You pass it a positive integer on the command line and it will print all the primes up to that number (by default it will print the primes up to 100). All the testing of divisibility is done by the regex engine attempting to match equal groupings of the digit ‘1’ from the start to end of the string.

*****

Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification

Andrew L. Johnson (First published by ItWorld.com 2001-01-11)

Last week I mentioned that the standard quantifiers are greedy. This week we will look at non-greedy forms of quantifiers, but first let’s discuss just what it means to be greedy.

    my $string = 'bcdabdcbabcd';
    $string =~ m/^(.*)ab/;
    print "$1\n";                # prints: bcdabdcb

The * is greedy and therefore the .* portion of the regex will match as much as it can and still allow the remainder of the regex to match. In this case it will match everything up to the last ‘ab’ in the string. Actually, the .* will match right to the end of the string, and then start backing up until it can match an ‘ab’ (this is called backtracking).

To make the quantifier non-greedy you simply follow it with a ’?’ symbol:

    my $string = 'bcdabdcbabcd';
    $string =~ m/^(.*?)ab/;
    print "$1\n";                # prints: bcd

In this case the .*? portion attempts to match the least amount of data while allowing the remainder of the regex to match. Here the regex engine will match the beginning of the string then it will try to match zero of anything and check to see if the rest can match (that fails), next it will match the ‘b’ and then check again if the ‘ab’ can match (still fails). This continues until the the .*? has matched the first 3 characters and then the following ‘ab’ is matched.

You can make any of the standard quantifiers that aren’t exact non-greedy by appending a ’?’ symbol to them: *?, +?, ??, {n,m}?, and {n,}?.

One thing to watch out for: given a pattern such as /^(.*?)%(.*?)/ one could match and extract the first two fields of a like of % separated data:

    #!/usr/bin/perl -w
    use strict;
    $_ = 'Johnson%Andrew%AX321%37';
    m/^(.*?)%(.*?)%/;
    print "$2 $1\n";

And one can easily begin to think of each subexpression as meaning ‘match up to the next % symbol’, but that isn’t exactly what it means. Let’s say that the third field represents an ID tag and we want to extract only those names of people with ID tags starting with ‘A’. We might be tempted to do this:

    #!/usr/bin/perl -w
    use strict;
    while (<DATA>) {
        print "$2 $1\n" if m/^(.*?)%(.*?)%A/;
    }
    __DATA__
    Johnson%Andrew%AX321%Manitoba
    Smith%John%BC142%Alberta

This would print out:

    Andrew Johnson
    John%BC142 Smith

But that isn’t what we wanted at all — what happened? Well, the second half of the regex does not say match up to the next % symbol and then match an ‘A’, it says, match up to the next % symbol that is followed by an ‘A’. The pattern ’(.*?)’ part is not prevented from matching and proceeding past a % character if that is what is necessary to cause the whole regex to succeed. What we really wanted in this case was a negated character class:

    #!/usr/bin/perl -w
    use strict;
    while (<DATA>) {
        print "$2 $1\n" if m/^([^%]*)%([^%]*)%A/;
    }
    __DATA__
    Johnson%Andrew%AX321%Manitoba
    Smith%John%BC142%Alberta

Now we are saying exactly what we want: the first subexpression grabs zero or more of anything except a % character, then we match a % character, then the second subexpression also grabs zero or more of anything but a % character, and finally we match ’%A’ or we fail.

To summarize, a greedy quantifier takes as much as it can get, and a non-greedy quantifier takes as little as possible (in both cases only while still allowing the entire regex to succeed). Take care in how you use non-greedy quantifiers — it is easy to get fooled into using one where a negated character class is more appropriate.

*****