Siaris

Using File::Find
15 Feb 01 - http://www.siaris.net/index.cgi/Programming/LanguageBits/Perl/20010215.rdoc

Reading the contents of a directory is easy using the opendir() and readdir() built-in functions, but once you need to dig a little deeper in the filesystem (recursively searching one or more directories), this method becomes cumbersome and sometimes tricky if your system supports symbolic links.

The File::Find module (which ships with perl) makes traversing directories and acting on the contents very easy and safe. The module exports one function (find()) which can be called in two ways. In the simplest method, you pass it a function reference and a list of directories to search. The function reference will contain the code you want to use to process each filesystem entry found by the find() function. For example, let’s say we want to do some routine cleanup in a directory. I often use LaTeX for my writing and generate either .dvi, .ps, or .pdf files for typeset output. I also often fail to delete these generated files and they end up needlessly using space. Even though hard drive space is cheap these days, I still want to clean out my writing directory from time to time (which contains a large number of subdirectories of various depths).

If I haven’t accessed a generated typeset file (of any variety) for more than say a week, I might as well delete it (I can always regenerate it from the LaTeX sources if I need to print it again later).

    #!/usr/bin/perl -w
    use strict;
    use File::Find;
    my @dirs = @ARGV or die "No valid directory argument(s)";
    find( sub{
              m/\.(dvi|ps|pdf)$/
              and -A > 7
              and print "$File::Find::name\n";
          },
          @dirs,
    );

The first argument to the find function is my subroutine reference (in this case an anonymous subroutine, but we could have used a reference to a real subroutine), and then the list of arguments from the command line. When find() runs it does a chdir() into each directory in turn (and any subdirectories) and sets $_ to be each filename found (the complete pathname of the file is given in the special package variable: $File::Find::name). So we first check if has one of the extensions we are interested in, and that its access time is greater than seven days and then we print the full pathname. In the real program we would use the unlink() function to delete each such file rather than just printing it (but if you’re developing such a script, please test it with print() until you are sure it is finding only what you wanted before you start deleting files with the unlink() function).

The alternate method of calling find() is to pass a hash reference as the first argument, and then the list of directories to search. The hash allows you to specify alternate behaviors for the find() function. There are several keys you may set in this hash, the more common ones are:

    wanted  => your sub reference
    bydepth => depth first searching
    follow  => follow symbolic links (be careful!)
    nochdir => don't chdir() into each directory

A few variables are also available to you within your subroutine, the most useful are:

    $_                    name of file
    $File::Find::dir      directory currently being searched
    $File::Find::name     Full pathname of file

File::Find can be utilized for all sorts of filesystem searching and administrative tasks and is a good tool to add to your repertoire.

*****