When we read data files in Perl using the input operator (the <> operator), we are, by default, reading one line at a time. However, it is helpful to realize we aren’t really processing one line at a time but rather one record at a time — it is just that the default record separator happens to be a newline character, which is a useful default value.
Very often we will process files that have no real record structure (in the sense of records containing specific fields), such as when we want to print out lines in an ordinary text file that match a certain pattern, or perhaps when doing some search and replace operations on a plain text article we have written. In these cases, treating lines as records works as we expect.
More importantly, one of the most common forms of structured plain text data files (often just called flat-file databases) is the csv format (comma separated values — although, other characters besides commas are commonly used). Such records are usually terminated by newlines (one record to a line) and contain various fields separated by a specific character (or character string):
name, rank, serial number
Processing such files is often simply a matter of:
while(<>){
chomp;
my @fields = split /,/;
process_fields(@fields);
}
Assuming a suitable ‘process_fields()’ subroutine is defined.
However, we are certainly not limited to reading one-line records, whether in structured or unstructured files. The input record separator variable in Perl is $/ and we can set that variable to any string we wish to use as a record separator (the default, as mentioned, is "\n").
In unstructured text we may wish to use a blank line as a record separator — attempting to read in one paragraph at a time for processing. One way would be to set:
$/ = "\n\n";
Anytime there are two newlines in a row, there must be a blank line in the file. Unfortunately, this is not very fault tolerant: If we had a file with 2 blank lines between 2 paragraphs, we’d read the first paragraph, and our second "paragraph" would contain an initial blank line. Worse, if we happened to have several blank lines between 2 paragraphs, we’d be reading in one or more empty paragraphs (records).
Fortunately, Perl has a special case for ‘paragraph’ records. Assigning an empty string to $/ puts Perl into paragraph mode and multiple blank lines are squashed into a single blank line (meaning we won’t get a bunch of empty paragraphs, but on the other hand, our output won’t have those extra blank lines either).
Another useful value for $/ is the undefined value (Perl’s ‘undef’ keyword). When we are reading a file in a typical while(<>) loop, the <> operator returns undef at the end of the file. Thus, if we set $/ to undef, we will read in the whole file as a single record (a single string). This can make a program more efficient speed-wise because we can operate on the whole file as a single string. Compare these two snippets:
$/ = undef; # while(<>){
$_ = <>; # tr/A-Z/a-z/;
tr/A-Z/a-z/; # print;
print; # }
They both accomplish the same thing (translating uppercase letters to lowercase letters), but the one on the left only calls the tr/// operator once while the second calls it once for each line in the file. for small to moderately sized files, the first will be faster. However, if the file is very large (relative to your available memory), then reading in the whole file may exhaust your memory entirely (which means the program won’t work at all), or perhaps cause your OS to start swapping (which will likely make it perform much worse than the version on the right).
*****



