Chapter 7 -- Perl Overview

Chapter 7

Perl Overview


CONTENTS


The Perl Quick Reference part is designed as a reference guide for the Perl language, rather than an introductory text. However, there are some aspects of the language that are better summarized in a short paragraph as opposed to a table in a reference section. Therefore, this part of the book puts the reference material in context giving an overview of the Perl language in general.

Running Perl

The simplest way to run a Perl program is to invoke the Perl interpreter with the name of the Perl program as an argument:

perl sample.pl

The name of the Perl file is sample.pl, and perl is the name of the Perl interpreter. This example assumes that Perl is in the execution path; if not, you will need to supply the full path to Perl too:

/usr/local/hin/perl sample.pl

This is the preferred way of invoking Perl because it eliminates the possibility that you might accidentally invoke a copy of Perl other than the one you intended. We will use the full path from now on to avoid any confusion.

This type of invocation is the same on all systems with a command-line interface. The following line will do the trick on Windows NT, for example:

c:\NTperl\perl sample.pl

Invoking Perl on UNIX

UNIX systems have another way to invoke an interpreter on a script file. Place a line like

#!/usr/local/bin/perl

at the start of the Perl file. This tells UNIX that the rest of this script file is to be interpreted by /usr/local/bin/perl. Then make the script itself executable:

chmod +x sample.pl

You can then "execute" the script file directly and let the script file tell the operating system what interpreter to use while running it.

Invoking Perl on Windows NT

Windows NT, on the other hand, is quite different. You can use File Manager (Explorer under Windows NT 4 or Windows 95) to create an association between the file extension .PL and the Perl executable. Whenever a file ending in .PL is invoked, Windows will know that Perl should be used to interpret it.

Command-Line Arguments

Perl takes a number of optional command-line arguments for various purposes. These are listed in Table 7.1. Most are rarely used but are given here for reference purposes.

Table 7.1  Perl 5 Command-Line Switches

Option
Arguments PurposeNotes
-0
Octal character codeSpecify record separator Default is newline (\n)
-a
noneAutomatically splitre cords Used with -n or or -p
-c
noneCheck syntax only Do not execute
-d
noneRun script using Perl debugger If Perl debugging option was included when Perl was installed
-D
flagsSpecify debugging behavior See table 2
-e
commandPass a command to Perl from the command line Useful for quick operations
-F
regular expressionExpression to split by if -a used Default is white space
-I
extensionReplace original file with results Useful for modifying contents of files
-I
directorySpecify location of include files  
-l
octal character codeDrop newlines when used with -n and -p and use designated character as line termination character  
-n
noneProcess the script using each specified file as an argumen Used for performing the same set of actions on a set of files
-p
noneSame as -n but each line is printed  
-P
noneRun the script through the C preprocessor before Perl compiles it  
-s
noneEnable passing of arbitrary switches to Perl Use -s -what -ever to have the Perl variables $what and $ever defined within your script
-S
noneTell Perl to look along the path for the script  
-T
noneUse taint checking; don't evaluate expressions supplied on the command line  
-u
none Make Perl dumb core after compiling your script; intended to allow for generation of Perl executables Very messy; wait for the Perl compiler
-U
noneUnsafe mode; overrides Perl's natural caution Don't use this!
-v
nonePrint Perl version number  
-w
nonePrint warnings about script syntax Extremely useful, especially during development

Tip
The -e option is handy for quick Perl operations from the command line. Want to change all instances of "oldstring" in Wiffle.bat to "newstrong"? Try
perl -i.old -p -e "s/ oldstring/ newstrong/g" wiffle.bat
This says: "Take each line of Wiffle.bat (-p); store the original in Wiffle.old (-i); substitute all instances of oldstring with newstrong (-e); write the result (-p) to the original file (-i)."

You can supply Perl command-line arguments on the interpreter invocation line in UNIX scripts. The following line is a good start to any Perl script:

#!/usr/local/bin/perl -w -t

Table 7.2 shows the debug flags, which can be specified with the -D command-line option. If you specify a number, you can simply add all the numbers of each flag together so that 6 is 4 and 2. If you use the letter as a flag then simply list all the options required. The following two calls are equivalent:

#perl -d -D6 test.pl
#perl -d -Dls test.pl

Table 7.2  Perl Debugging Flags

Flag Number
Flag Letter
Meaning of Flag
1
p
Tokenizing and parsing
2
s
Stack snapshots
4
l
Label stack processing
8
t
Trace execution
16
o
Operator node construction
32
c
String/numeric conversions
64
P
Print preprocessor command for -P
128
m
Memory allocation
256
f
Format processing
512
r
Regular expression parsing
1024
x
Syntax tree dump
2048
u
Tainting checks
4096
L
Memory leaks (not supported anymore)
8192
H
Hash dump; usurps values()
6384
X
Scratchpad allocation (Perl 5 only)
32768
D
Cleaning up (Perl 5 only)

A Perl Script

A Perl program consists of an ordinary text file containing a series of Perl commands. Commands are written in what looks like a bastardized amalgam of C, shell script, and English. In fact, that's pretty much what it is.

Perl code can be quite free-flowing. The broad syntactic rules governing where a statement starts and ends are

Here's a Perl statement inspired by Kurt Vonnegut:

print "My name is Yon Yonson\n";

No prizes for guessing what happens when Perl runs this code; it prints

My name is Yon Yonson

If the \n doesn't look familiar, don't worry; it simply means that Perl should print a newline character after the text; in other words, Perl should go to the start of the next line.

Printing more text is a matter of either stringing together statements or giving multiple arguments to the print function:

print "My name is Yon Yonson,\n";
print "I live in Wisconsin,\n",
      "I work in a lumbermill there.\n";

That's right, print is a function. It may not look like it in any of the examples so far, where there are no parentheses to delimit the function arguments, but it is a function, and it takes arguments. You can use parentheses in Perl functions if you like; it sometimes helps to make an argument list clearer. More accurately, in this example the function takes a single argument consisting of an arbitrarily long list. We'll have much more to say about lists and arrays later, in the "Data Types" section. There will be a few more examples of the more common functions in the remainder of this chapter, but refer to the "Functions" chapter for a complete run-down on all of Perl's built-in functions.

So what does a complete Perl program look like? Here's a trivial UNIX example, complete with the invocation line at the top and a few comments:

#!/usr/local/bin/perl -w                    # Show warnings

print "My name is Yon Yonson,\n";           # Let's introduce ourselves
print "I live in Wisconsin,\n",    
      "I work in a lumbermill there.\n";    # Remember
the line breaks

That's not at all typical of a Perl program though; it's just a linear sequence of commands with no structural complexity. The "Flow Control" section later in this overview introduces some of the constructs that make Perl what it is. For now, we'll stick to simple examples like the preceding for the sake of clarity.

Data Types

Perl has a small number of data types. If you're used to working with C, where even characters can be either signed or unsigned, this makes a pleasant change. In essence, there are only two data types: scalars and arrays. There is also a very special kind of array called an associative array that merits a section all to itself.

Scalars

All numbers and strings are scalars. Scalar variable names start with a dollar sign.

Note
All Perl variable names, including scalars, are case sensitive. $Name and $name, for example, are two completely different quantities.

Perl converts automatically between numbers and strings as required, so that

$a = 2;
$b = 6;
$c = $a . $b; # The "." operator concatenates two strings
$d = $c / 2;
print $d;

yields the result

13

This example involves converting two integers into strings, concatenating the strings into a new string variable, converting this new string to an integer, dividing it by two, converting the result to a string, and printing it. All of these conversions are handled implicitly, leaving the programmer free to concentrate on what needs to be done rather than the low-level details of how it is to be done.

This might be a problem if Perl were regularly used for tasks where, for example, explicit memory offsets were used and data types were critical. But for the type of task where Perl is normally used, these automatic conversions are smooth, intuitive, and useful.

We can use this to develop the earlier example script using some string variables:

#!/usr/local/bin/perl -w                    # Show warnings

$who = 'Yon Yonson';
$where = 'Wisconsin';
$what = 'in a lumbermill';

print "My name is $who,\n";                 # Let's introduce ourselves
print "I live in $where,\n",    
      "I work $what there.\n";                   # Remember the line breaks

print "\nSigned: \t$who,\n\t\t$where.\n";
which yields
My name is Yon Yonson,
I work in Wisconsin,
I work in a lumbermill there.

Signed:    Yon Yonson,
     Winsconsin.

Arrays

A collection of scalars is an array. An array variable name starts with an @ sign, while an explicit array of scalars is written as a comma-separated list within parentheses:

@trees = ("Larch", "Hazel", "Oak");

Array subscripts are denoted using square brackets: $trees[0] is the first element of the @trees array. Notice that it's @trees but $trees[0]; individual array elements are scalars, so they start with a $.

Mixing scalar types in an array is not a problem. For example,

@items = (15, 45.67, "case");
print "Take $items[0] $items[2]s at \$$items[1]
each.\n";

results in

Take 15 cases at $45.67 each.

All arrays in Perl are dynamic. You never have to worry about memory allocation and management because Perl does all that stuff for you. Combine that with the fact that arrays can contain arrays as sub-arrays, and you're free to say things like the following:

@A = (1, 2, 3);
@B = (4, 5, 6);
@C = (7, 8, 9);
@D = (@A, @B, @C);

which results in the array @D containing numbers 1 through 9. The power of constructs such as

@Annual = (@Spring, @Summer, @Fall, @Winter);

takes some getting used to.

Note
An aspect of Perl that often confuses newcomers (and occasionally the old hands too) is the context-sensitive nature of evaluations. Perl keeps track of the context in which an expression is being evaluated and can return a different value in an array context than in a scalar context. In the following example:
@A = (1, 2, 3, 4);
@B = @A;
$C = @A;
the array @B contains 1 through 4 while $C contains "4", the number of values in the array. This context-sensitivity becomes more of an issue when you use functions and operators that can take either a single argument or multiple arguments. The results can be quite different depending on what is passed to them.

Many of Perl's built-in functions take arrays as arguments. One example is sort, which takes an array as argument and returns the same array sorted alphabetically:

print sort ( 'Beta', 'Gamma', 'Alpha' );

prints AlphaBetaGamma.

We can make this neater using another built-in function, join. This function takes two arguments: A string to connect with and an array of strings to connect. It returns a single string consisting of all elements in the array joined with the connecting string:

print join ( ' : ', 'Name', 'Address', 'Phone' );

returns the string Name : Address : Phone.

Because sort returns an array, we can feed its output straight into join:

print join( ', ', sort ( 'Beta', 'Gamma', 'Alpha' ) );

prints Alpha, Beta, Gamma.

Note that we haven't separated the initial scalar argument of join from the array that follows it: The first argument is the string to join things with; the rest of the arguments are treated as a single argument, the array to be joined. This is true even if we use parentheses to separate groups of arguments:

print join( ': ', ('A', 'B', 'C'), ('D', 'E'), ('F', 'G', 'H', 'I'));

returns A: B: C: D: E: F: G: H: I. That's because of the way Perl treats arrays; adding an array to an array gives us one larger array, not two arrays. In this case, all three arrays get bundled into one.

Tip
For even more powerful string manipulation capabilities, refer to the splice function in Chapter 10, "Perl Functions."

Associative Arrays

There is a certain elegance to associative arrays that makes experienced Perl programmers a little snobbish about their language of choice. Rightly so! Associative arrays give Perl a degree of database functionality at a very low yet useful level. Many tasks that would otherwise involve complex programming can be reduced to a handful of Perl statements using associative arrays.

Arrays of the type we've already seen are lists of values indexed by subscripts. In other words, to get an individual element of an array, you supply a subscript as a reference:

@fruit = ("Apple", "Orange", "Banana");
print $fruit[2];

This example yields Banana because subscripts start at 0 and so 2 is the subscript for the third element of the @fruit array. A reference to $fruit[7] here returns the null value, as no array element with that subscript has been defined.

Now, here's the point of all this: Associative arrays are lists of values indexed by strings. Conceptually, that's all there is to them. Their implementation is more complex, obviously, as all of the strings need to be stored in addition to the values to which they refer.

When you want to refer to an element of an associative array, you supply a string (also called the key) instead of an integer (also called the subscript). Perl returns the corresponding value. Consider the following example:

%fruit = ("Green", "Apple", "Orange", "Orange",
"Yellow", "Banana");
print $fruit{"Yellow"};

This prints Banana as before. The first line defines the associative array in much the same way as we have already defined ordinary arrays; the difference is that instead of listing values, we list key/value pairs. The first value is Apple and its key is Green; the second value is Orange, which happens to have the same string for both value and key; and the final value is Banana and its key is Yellow.

On a superficial level, this can be used to provide mnemonics for array references, allowing us to refer to $Total{'June'} instead of $Total[5]. But that's not even beginning to use the power of associative arrays. Think of the keys of an associative arrays as you might think of a key linking tables in a relational database, and you're closer to the idea:

%Folk =   ( 'YY', 'Yon Yonson',
            'TC', 'Terra Cotta',
            'RE', 'Ron Everly' );

%State = ( 'YY', 'Wisconsin',
           'TC', 'Minnesota',
           'RE', 'Bliss' );

%Job = ( 'YY', 'work in a lumbermill',
         'TC', 'teach nuclear physics',
         'RE', 'watch football');

foreach $person ( 'TS', 'YY', 'RE' )  {
        print "My name is $Folk{$person},\n",
              "I live in $State{$person},\n",
              "I $Job{$person} there.\n\n";
        }

The foreach construct is explained later in the "Flow Control" section; for now, you just need to know that it makes Perl execute the three print statements for each of the people in the list after the foreach keyword.

The keys and values of an associative array may be treated as separate (ordinary) arrays as well, by using the keys and values keywords respectively:

print keys %Folk;
print values %State;

prints the string YYRETCWisconsinBlissMinnesota. String handling will be discussed later in this chapter.

Note
There is a special associative array, %ENV, that stores the contents of all environment variables, indexed by variable name. So $ENV{'PATH'} returns the current search path, for example. Here's a way to print the current value of all environment variables, sorted by variable name for good measure:
foreach $var (sort keys %ENV ) {
     rint "$var: \"$ENV{$var}\".\n";
The foreach clause sets $var to each of the environment variable names in turn (in alphabetical order), and the print statement prints each name and value. As the symbol " is used to specify the beginning and end of the string being printed, when we actually want to print a " we have to tell Perl to ignore the special meaning of the character. This is done by prefixing it with a backslash character (this is sometimes called quoting a character).

File Handles

We'll finish our look at Perl data types with a look at file handles. Really this is not a data type but a special kind of literal string. A file handle behaves in many ways like a variable, however, so this is a good time to cover them. Besides, you won't get very far in Perl without them…

A file handle can be regarded as a pointer to a file from which Perl is to read, or to which it will write. C programmers will be familiar with the concept. The basic idea is that you associate a handle with a file or device, and then refer to the handle in the code whenever you need to perform a read or write operation.

File handles are generally written in all uppercase. Perl has some useful predefined file handles, which are listed in Table 7.3.

Table 7.3  Perl's Predefined File Handles

File HandlePoints To
STDIN Standard input, normally the keyboard
STDOUT Standard output, normally the console
STDERR Device where error messages should be written, normally the console

The print statement can take a file handle as its first argument:

print STDERR "Oops, something broke.\n";

Note that there is no comma after the file handle, which helps Perl to figure out that the STDERR is not something to be printed. If you're uneasy with this implicit list syntax, you can put parentheses around all of the print arguments:

print (STDERR "Oops, something broke.\n");

Note that there is still no comma after the file handle.

Tip
Use the standard file handles explicitly, especially in complex programs. It is sometimes convenient to redefine the standard input or output device for a while; make sure that you don't accidentally wind up writing to a file what should have gone to the screen.

The open function may be used to associate a new file handle with a file:

open (INDATA, "/etc/stuff/Friday.dat");
open (LOGFILE, ">/etc/logs/reclaim.log");
print LOGFILE "Log of reclaim procedure\n";

By default, open opens files for reading only. If you want to override this default behavior, add one of the special direction symbols from Table 7.4 to the file name. That's what the > at the start of the file name in the second output statement is for; it tells Perl that we intend to write to the named file.

Table 7.4  Perl File Access Symbols

SymbolMeaning
< Opens the file for reading. This is the default action.
> Opens the file for writing.
>> Opens the file for appending.
+< Opens the file for both reading and writing.
+> Opens the file for both reading and writing.
| (before file name) Treats file as command into which Perl is to pipe text.
| (after file name) Treats file as command from which input is to be piped to Perl.

To take a more complex example, the following is one way to feed output to the mypr printer on a UNIX system:

open (MYLPR, "|lpr -Pmypr");
print MYLPR "A line of output\n";
close MYLPR;

There is a special Perl operator for reading from files. It consists of two angle brackets around the file handle of the file from which we want to read, and it returns the next line or lines of input from the file or device, depending on whether the operator is used in a scalar or an array context. When no more input remains, the operator returns False.

For example, a construct like the following

while (<STDIN>) {
print;
}

simply echoes each line of input back to the console until the Ctrl and D keys are pressed. That's because the print function takes the current default argument here, the most recent line of input. Refer to Chapter 8, "Perl Special Variables," for an explanation.

If the user types

A
Bb
Ccc
^D

then the screen will look like

A
A
Bb
Bb
Ccc
Ccc
^D

Note that in this case, <STDIN> is in a scalar context and so one line of standard input is returned at a time. Compare that with the following example:

print <STDIN>;

In this case, because print expects an array of arguments (it can be a single element array, but it's an array as far as print is concerned), the <> operator obligingly returns all the contents of STDIN as an array and print then prints it. This means that nothing is written to the console until the user presses the Ctrl and D keys:

A
Bb
Ccc
^Z
A
Bb
Ccc

This script prints out the contents of the file .Signature, double-spaced:

open (SIGFILE, ".signature");
while ( <SIGFILE> ) {
print; print "\n";
}

The first print has no arguments, so it takes the current default argument and prints it. The second has an argument, so it prints that instead. Perl's habit of using default arguments extends to the <> operator: if used with no file handle, it is assumed that <ARGV> is intended. This expands to each line in turn of each file listed on the command line.

If no files are listed on the command line, it is instead assumed that STDIN is intended. So for example,

while (<>) {
print "more.... ";
}

keeps printing more.... as long as something other than Ctrl+D appears on standard input.

Note
Perl 5 allows array elements to be references to any data type. This makes it possible to build arbitrary data structures of the kind used in C and other high-level languages, but with all the power of Perl; you can, for example, have an array of associative arrays.

Flow Control

The examples we've seen so far have been quite simple, with little or no logical structure beyond a linear sequence of steps. We managed to sneak in the occasional while and foreach. Perl has all of the flow control mechanisms you'd expect to find in a high-level language, and this section takes you through the basics of each.

Logical Operators

Let's start with two operators that are used like glue holding Perl programs together: the || (or) and && (and) operators. They take two operands and return either True or False depending on the operands:

$Weekend = $Saturday || $Sunday;

If either $Saturday or $Sunday is true, then $Weekend is true.

$Solvent = ($income > 3) && ($debts < 10);

$Solvent is true only if $income is greater than 3 and $debts is less than 10.

Now consider the logic of evaluating one of these expressions. It isn't always necessary to evaluate both operands of either a && or a || operator. In the first example, if $Saturday is true, then we know $Weekend is true, regardless of whether $Sunday is also true.

This means that having evaluated the lefthand side of an || expression as true, the righthand side will not be evaluated. Combine this with Perl's easy way with data types, and you can say things like the following:

$value > 10 || print "Oops, low value $value ...\n";

If $value is greater than 10, the righthand side of the expression is never evaluated, so nothing is printed. If $value is not greater than 10, Perl needs to evaluate the righthand side to decide whether the expression as a whole is True or False. That means it evaluates the print statement, printing the message like

Oops, low value 6...

Okay, it's a trick, but it's a very useful one.

Something analogous applies to the && operator. In this case, if the lefthand side of an expression is False, then the expression as a whole is false and so Perl will not evaluate the righthand side. This can be used to produce the same kind of effect as our || trick but with the opposite sense:

$value > 10 && print "OK, value is high enough...\n";

As with most Perl constructs, the real power of these tricks comes when you apply a little creative thinking. Remember that the left- and righthand sides of these expressions can be any Perl expression; think of them as conjunctions in a sentence rather than as logical operators and you'll get a better feel for how to use them. Expressions such as

$length <= 80 || die "Line too long.\n";
$errorlevel > 3 && warn "Hmmm, strange error level ($errorlevel)...\n";
open ( LOGFILE, ">install.log") || &bust("Log file");

give a little of the flavor of creative Perl.

The &bust in that last line is a subroutine call, by the way. Refer to the "Subroutines" section later in this chapter for more information.

Conditional Expressions

The basic kind of flow control is a simple branch: A statement is either executed or not depending on whether a logical expression is True or False. This can be done by following the statement with a modifier and a logical expression:

open ( INFILE, "./missing.txt") if $missing;

The execution of the statement is contingent upon both the evaluation of the expression and the sense of the operator.

The expression evaluates as either True or False and can contain any of the relational operators listed in Table 7.5, although it doesn't have to. Examples of valid expressions are

$full
$a == $b
<STDIN>

Table 7.5  Perl's Relational Operators

OperatorNumeric Context String Context
Equality
==
eq
Inequality
!=
ne
Inequality with signed result
<=>
cmp
Greater than
>
gt
Greater than or equal to
>=
ge
Less than
<
lt
Less than or equal to
<=
le

Note
What exactly does "less than" mean when we're comparing strings? It means "lexically less than." If $left comes before $right when the two are sorted alphabetically, $left is less than $right.

There are four modifiers, each of which behaves the way you might expect from the corresponding English word:

$max = 100 if $min < 100;
print "Empty!\n" if !$full;
open (ERRLOG, "test.log") unless $NoLog;
print "Success" unless $error>2;
$total -= $decrement while $total > $decrement;
$n=1000; "print $n\n" while $n- > 0;
$total += $value[$count++] until $total > $limit;
print RESULTS "Next value: $value[$n++]" until $value[$n] = -1;

Note that the logical expression is evaluated once only in the case of if and unless but multiple times in the case of while and until. In other words, the first two are simple conditionals, while the last two are loop constructs.

Compound Statements

The syntax changes when we want to make the execution of multiple statements contingent on the evaluation of a logical expression. The modifier comes at the start of a line, followed by the logical expression in parentheses, followed by the conditional statements contained in braces. Note that the parentheses around the logical expression are required, unlike with the single statement branching described in the previous section. For example,

if ( ( $total += $value ) > $limit )  {
   print LOGFILE "Maximum limit $limit exceeded.",
" Offending value was $value.\n";
close (LOGFILE);
  die "Too many! Check the log file for details.\n";
   }

This is somewhat similar to C's if syntax, except that the braces around the conditional statement block are required rather than optional.

The if statement is capable of a little more complexity, with else and elsif operators:

if ( !open( LOGFILE, "install.log") )   {
   close ( INFILE );
   die "Unable to open log file!\n";
   }
elseif ( !open( CFGFILE, ">system.cfg") )  {
   print LOGFILE "Error during install:",
" Unable to open config file for writing.\n";
close ( LOGFILE );
   die "Unable to open config file for writing!\n";
   }
else  {
   print CFGFILE "Your settings go here!\n";
   }

Loops

The loop modifiers (while, until, for, and foreach) are used with compound statements in much the same way:

until ( $total >= 50 )  {
   print "Enter a value: ";
   $value = scalar (<STDIN>);
   $total += $value;
   print "Current total is $total\n";
   }
print "Enough!\n";

The while and until statements were described in the earlier "Conditional Expressions" section. The for statement resembles the one in C: It is followed by an initial value, a termination condition, and an iteration expression, all enclosed in parentheses and separated by semicolons:

for ( $count = 0; $count < 100; $count++ )  {
   print "Something";
   }

The foreach operator is special. It iterates over the contents of an array and executes the statements in a statement block for each element of the array. A simple example is the following:

@numbers = ("one", "two", "three", "four");
foreach $num ( @numbers )   {
   print "Number $num...\n";
   }

The variable $num first takes on the value one, then two, and so on. That example looks fairly trivial, but the real power of this operator lies in the fact that it can operate on any array:

foreach $arg ( @ARGV )   {
   print "Argument: \"$arg\".\n";
   }
foreach $namekey ( sort keys %surnames )  {
   print REPORT "Surname: $value{$namekey}.\n",
                "Address: $address{$namekey}.\n";
   }

Labels

Labels may be used with the next, last, and redo statements to provide more control over program flow through loops. A label consists of any word, usually in uppercase, followed by a colon. The label appears just before the loop operator (while, for, or foreach) and can be used as an anchor for jumping to from within the block:

RECORD:  while ( <INFILE> )  {
   $even = !$even;
   next RECORD if $even;
   print;
   }

That code snippet prints all the odd-numbered records in INFILE.

The three label control statements are

Subroutines

The basic subunit of code in Perl is a subroutine. This is similar to a function in C and a procedure or a function in Pascal. A subroutine may be called with various parameters and returns a value. Effectively, the subroutine groups together a sequence of statements so that they can be re-used.

The Simplest Form of Subroutine. Subroutines can be declared anywhere in a program. If more than one subroutine with the same name is declared each new version replaces the older ones, so that only the last one is effective. It is possible to declare subroutines within an eval() expression, these will not actually be declared until the runtime execution reaches the eval() statement.

Subroutines are declared using the following syntax:

sub subroutine-name {
                statements
}

The simplest form of subroutine is one that does not return any value and does not access any external values. The subroutine is called by prefixing the name with the &
character. (There are other ways of calling subroutines, which are explained in more detail later.) An example of a program using the simplest form of subroutine illustrates this:

#!/usr/bin/perl -w
# Example of subroutine which does not use
# external values and does not return a value
&egsub1; # Call the subroutine once
&egsub1; # Call the subroutine a second time
sub egsub1 {
     print "This subroutine simply prints this line.\n";
}

Tip
While it is possible to refer from a subroutine to any global variable directly, it is normally considered bad programming practice. Reference to global variables from subroutines makes it more difficult to re-use the subroutine code. It is best to make any such references to external values explicit by passing explicit parameters to the subroutine as described in the following section. Similarly it is best to avoid programming subroutines that directly change the values of global variables because doing so can lead to unpredictable side-effects if the subroutine is re-used in a different program. Use explicit return values or explicit parameters passed by reference as described in following section.

Returning Values from Subroutines. Subroutines can also return values, thus acting as functions. The return value is the value of the last statement executed. This can be a scalar or an array value.

Caution
Take care not to add seemingly innocuous statements near the end of a subroutine. A print statement returns 1, for example, so a subroutine that prints just before it returns will always return 1.

It is possible to test whether the calling context requires an array or a scalar value using the wantarray construct, thus returning different values depending on the required context. For example,

     wantarray ? (a, b, c) : 0;

as the last line of a subroutine returns the array (a, b, c) in an array context, and the scalar value 0 in a scalar context.

#!/usr/bin/perl -w
# Example of subrotine which does not use
# external values but does return a value
# Call the subroutine once, returning a scalar value
$scalar-return = &egsub2;
print "Scalar return value: $scalar-return.\n";
# Call the subroutine a second time, returning an array value
@array-return = &egsub2;
print "Array return value:", @array-return, ".\n";
sub egsub2 {
     print "This subroutine prints this line and returns a value.\n";
     wantarray ? (a, b, c) : 0;
}

It is possible to return from a subroutine before the last statement by using the return() function. The argument to the return() function is the returned value in this case. This is illustrated in the following example, which is not a very efficient way to do the test but illustrates the point:

#!/usr/bin/perl -w
# Example of subrotine which does not use
# external values but does return a value using "return"
$returnval = &egsub3; # Call the subroutine once
print "The current time is $returnval.\n";
sub egsub3 {
     rint "This subroutine prints this line and returns a value.\n";
     cal($sec, $min, $hour, @rest) =
          gmtime(time);
     $min == 0) && ($hour == 12) && (return "noon");
     f ($hour > 12)
          return "after noon";
     lse
          return "before noon";
}

Note that it is usual to make any variables used within a subroutine local() to the enclosing block. This means that they will not interfere with any variables that have the same name in the calling program. In Perl 5, these may be made lexically local rather than dynamically local, using my() instead of local() (this is discussed in more detail later).

When returning multiple arrays, the result is flattened into one list so that, effectively, only one array is returned. So in the following example all the return values are in
@return-a1 and the send array @return-a2 is empty.

#!/usr/bin/perl -w
# Example of subrotine which does not use
# external values returning an array
(@return-a1, @return-a2) = &egsub4; # Call the subroutine once
print "Return array a1",@return-a1,
" Return array a2 ",@return-a2, ".\n";
sub egsub4 {
    print "This subroutine returns a1 and a2.\n";
    local(@a1) = (a, b, c);
    local(@a2) = (d, e, f);
    return(@a1,@a2);
}

In Perl 4, this problem can be avoided by passing the arrays by reference using a typeglob (see the following section). In Perl 5, you can do this and also manipulate any variable by reference directly (see the following section).

Passing Values to Subroutines. The next important aspect of subroutines, is that the call can pass values to the subroutine. The call simply lists the variables to be passed, and these are passed in the list @_ to the subroutine. These are known as the parameters or the arguments. It is customary to assign each value a name at the start of the subroutine so that it is clear what is going on. Manipulation of these copies of the arguments is equivalent to passing arguments by value (that is, their values may be altered but this does not alter the value of the variable in the calling program).

#!/usr/bin/perl -w
# Example of subrotine is passed external values by value
$returnval = &egsub5(45,3); # Call the subroutine once
print "The (45+1) * (3+1) is $returnval.\n";
$x = 45;
$y = 3;
$returnval = &egsub5($x,$y);
print "The ($x+1) * ($y+1) is $returnval.\n";
print "Note that \$x still is $x, and \$y still is $y.\n";
sub egsub5 { # Access $x and $y by value
local($x, $y) = @_;
return ($x++ * $y++);
}

To pass scalar values by reference, rather than by value, the elements in @_ can be accessed directly. This will change their values in the calling program. In such a case, the argument must be a variable rather than a literal value, as literal values cannot be altered.

#!/usr/bin/perl -w
# Example of subrotine is passed external values by reference
$x = 45;
$y = 3;
print "The ($x+1) * ($y+1) ";
$returnval = &egsub6($x,$y);
print "is $returnval.\n";
print "Note that \$x now is $x, and \$y now is $y.\n";
sub egsub6 { # Access $x and $y by reference
    return ($_[0]++ * $_[0]++);
}

Array values can be passed by reference in the same way. However several restrictions apply. First, as with returned array values, the @_ list is one single flat array, so passing multiple arrays this way is tricky. Also, although individual elements may be altered in the subroutine using this method, the size of the array cannot be altered within the subroutine (so push() and pop() cannot be used).

Therefore, another method has been provided to facilitate the passing of arrays by reference. This method is known as typeglobbing and works with Perl 4 or Perl 5. The principle is that the subroutine declares that one or more of its parameters are typeglobbed, which means that all the references to that identifier in the scope of the subroutine are taken to refer to the equivalent identifier in the namespace of the calling program. The syntax for this declaration is to prefix the identifier with an asterisk, rather than an @ sign, this *array1 typeglobs @array1. In fact, typeglobbing links all forms of the identifier so the *array1 typeglobs @array1, %array1, and $array1 (any reference to any of these in the local subroutine actually refers to the equivalent variable in the calling program's namespace). It only makes sense to use this construct within a local() list, effectively creating a local alias for a set of global variables. So the previous example becomes the following:

#!/usr/bin/perl -w
# Example of subrotine using arrays passed by reference (type globbing)
&egsub7(@a1,@a2); # Call the subroutine once
print "Modified array a1",@a1," Modified array a2 ",@a2, ".\n";
sub egsub7 {
    local(*a1,*a2) = @_;
    print "This subroutine modifies a1 and a2.\n";
    @a1 = (a, b, c);
    @a2 = (d, e, f);
}

In Perl 4, this is the only way to use references to variables, rather than variables themselves. In Perl 5, there is also a generalized method for dealing with references. Although this method looks more awkward in its syntax, because of the abundance of underscores, it is actually more precise in its meaning. Typeglobbing automatically aliases the scalar, the array, and the hashed array form of an identifier, even if only the array name is required. With Perl 5 references this distinction can be made explicit; only the array form of the identifier is referenced.

#!/usr/bin/perl -w
# Example of subroutine using arrays passed
# by reference (Perl 5 references)
&egsub7(\@a1,\@a2); # Call the subroutine once
print "Modified array a1",@a1," Modified array a2 ",@a2, ".\n";
sub egsub7 {
    local($a1ref,$a2ref) = @_;
    print "This subroutine modifies a1 and a2.\n";
    @$a1ref = (a, b, c);
    @$a2ref = (d, e, f);
}

Subroutine Recursion. One the most powerful features of subroutines is their ability to call themselves. There are many problems that can be solved by repeated application of the same procedure. However, care must be taken to set up a termination condition where the recursion stops and the execution can unravel itself. Typical examples of this approach are found when processing lists: Process the head item and then process the tail; if the tail is empty do not recurse. Another neat example is the calculation of a factorial value:

#!/usr/bin/perl -w
#
# Example factorial using recursion

for ($x=1; $x<100; $x++) {
        print "Factorial $x is ",&factorial($x), "\n";
}

sub factorial {
        local($x) = @_;
        if ($x == 1) {
                return 1;
        }
        else {
                return ($x*($x-1) + &factorial($x-1));
        }
}

Issues of Scope with my() and local(). Issues of scope are very important with relation to subroutines. In particular all variables inside subroutines should be made lexical local variables (using my()) or dynamic local variables (using local()). In Perl 4, the only choice is local() because my() was only introduced in Perl 5.

Variables declared using the my() construct are considered to be lexical local variables. They are not entered in the symbol table for the current package. Therefore, they are totally hidden from all contexts other than the local block within which they are declared. Even subroutines called from the current block cannot access lexical local variables in that block. Lexical local variables must begin with an alphanumeric character or an underscore.

Variables declared using the local() construct are considered to be dynamic local variables. The value is local to the current block and any calls from that block. It is possible to localize special variables as dynamic local variables, but these cannot be made into lexical local variables. These two differences from lexical local variables show the two cases in Perl 5 where it is still advisable to use local() rather than my():

Pattern Matching

We'll finish this overview of Perl with a look at Perl's pattern matching capabilities. The ability to match and replace patterns is vital to any scripting language that claims to be capable of useful text manipulation. By this stage, you probably won't be surprised to read that Perl matches pattern better than any other general purpose language. Perl 4's pattern matching was excellent, but Perl 5 has introduced some significant improvements, including the capability to match even more arbitrary strings than before.

The basic pattern matching operations we'll be looking at are

The patterns referred to here are more properly known as regular expressions, and we'll start by looking at them.

Regular Expressions

A regular expression is a set of rules describing a generalized string. If the characters that make up a particular string conform to the rules of a particular regular expression, then the regular expression is said to match that string.

A few concrete examples usually helps after an overblown definition like that. The regular expression b. will match the strings bovine, above, Bobby, and Bob Jones but not the strings Bell, b, or Bob. That's because the expression insists that the letter b must be in the string and it must be followed immediately by another character.

The regular expression b+, on the other hand, requires the lowercase letter b at least once. This matches b and Bob in addition to the example matches for b.. The regular expression b* requires zero or more bs, so it will match any string. That is fairly useless, but it makes more sense as part of a larger regular expression; for example, Bob*y matches Boy, Boby, and Bobby but not Boboby.

Assertions. There are a number of so-called assertions that are used to anchor parts of the patter to word or string boundaries. The ^ assertion matches the start of a string, so the regular expression ^fool matches fool and foolhardy but not tomfoolery or April fool. The assertions are listed in Table 7.6.

Table 7.6  Perl's Regular Expression Assertions

AssertionMatches ExampleMatches Doesn't Match
^
Start of string ^fool foolish tomfoolery
$
End of string fool$ April fool foolish
\b
Word boundarybe\bside be side beside
\B
Non-word boundarybe\Bside beside be side

Atoms. The . we saw in b. is an example of a regular expression atom. Atoms are, as the name suggests, the fundamental building blocks of a regular expression. A full list appears in Table 7.7.

Table 7.7  Perl's Regular Expression Atoms

AtomMatches ExampleMatches Doesn't Match
. Any character except newlineb.b bob bb
List of characters in square bracketsAny one of those characters ^[Bb] Bob, bob Rbob
Regular expression in parenthesesAnything that regular expression matches ^a(b.b)c$ abobc abbc

Quantifiers. A quantifier is a modifier for an atom. It can be used to specify that a particular atom must appear at least once, for example, as in b+. The atom quantifiers are listed in Table 7.8.

Table 7.8  Perl's Regular Expression Atom Quantifiers

QuantifierMatchesExample MatchesDoesn't Match
*
Zero or more instances of the atom ab*c ac, abc abb
+
One or more instances of the atom ab*c abcac
?
Zero or one instances of the atom ab?c ac, abc abbc
{n}
n instances of the atom ab{2}c abbc abbbc
{n,}
At least n instances of the atom ab{2,}c abbc, abbbc abc
{nm}
At least n, at most m instances of the atom ab{2,3}c abbc abbbbc

Special Characters. There are a number of special characters denoted by the backslash; \n being especially familiar to C programmers perhaps. Table 7.9 lists the specialcharacters.

Table 7.9  Perl Regular Expression's Special Characters

Symbol
Matches
Example
Matches
Doesn't Match
\d
Any digit
b\dd
b4d
bad
\D
Non-digit
b\Dd
bdd
b4d
\n
Newline
 
 
 
\r
Carriage return
 
 
 
\t
Tab
 
 
 
\f
Formfeed
 
 
 
\s
Whitespace character
 
 
 
\S
Non-whitespace character
 
 
 
\w
Alphanumeric character
a\wb
a2b
a^b
\W
Non-alphanumeric character
a\Wb
aa^b
aabb

Backslashed Tokens. It is essential that regular expressions are able to use all characters so that all possible strings that occur in the real word can be matched. With so many characters having special meanings, a mechanism is therefore required that allows us to represent any arbitrary character in a regular expression.

This is done using a backslash followed by a numeric quantity. This quantity can take on any of the following formats:

Matching

Let's start putting all of that together with some real pattern matching. The match operator normally consists of two forward slashes with a regular expression in between, and it normally operates on the contents of the $_ variable. So if $_ is serendipity, then /^ser/, /end/ and /^s.*y$/ are all True.

Matching on $_. The $_ operator is special; it is described in full in Chapter 8, "Perl Special Variables." In many ways, it is the default container for data being read in by Perl; the <>operator, for example, gets the next line from STDINand stores it in$_. So the following code snippet lets you type lines of text and tells you when your line matches one of the regular expressions:

$prompt = "Enter some text or press Ctrl-Z to stop: ";
print $prompt;
while (<>)  {
   /^[aA]/ && print "Starts with a or A.  ";
   /[0-9]$/ && print "Ends with a digit.  ";
   /perl/ && print "You said it!   ";
   print $prompt;
   }

Bound Matches. Matching doesn't always have to operate on $_, although this default behavior is quite convenient. There is a special operator, =~ that evaluates to either True or False depending on whether its first operand matches on its second operand. For example, $filename =~ /dat$/ is true if $filename matches on /dat$/. This can be used in conditionals in the usual way:

$filename =~ /dat$/ && die "Can't use .dat files.\n";

There is a corresponding operator with the opposite sense, !~. This is True if the first operator does not match on the second:

$ENV{'PATH'} !~ /perl/ && warn "Not sure if perl is in your path...";

Alternate Delimiters. The match operator can use other characters instead of //; a useful point if you're trying to match a complex expression involving forward slashes. A more general form of the match operator than // is m//. If you use the leading m here, then any character may be used to delimit the regular expression. For example,

$installpath =~ m!^/usr/local!
|| warn "The path you have chosen is odd.\n";

Match Options. A number of optional switches may be applied to the match operator (either the // or m// forms) to alter its behavior. These options are listed in Table 7.10.

Table 7.10  Perl Match Operator's Optional Switches

Switch
Meaning
g
Perform global matching
i
Case-insensitive matching
o
Evaluate the regular expression once only

The g switch continues matching even after the first match has been found. This is useful when using backreferences to examine the matched portions of a string, as described in the later "Backreferences" section.

The o switch is used inside loops where a lot of pattern matching is taking place. It tells Perl that the regular expression (the match operator's operand) is to be evaluated once only. This can improve efficiency in cases where the regular expression is fixed for all iterations of the loop that contains it.

Backreferences. As we mentioned earlier in the "Backslashed Tokens" section, pattern matching produces quantities known as backreferences. These are the parts of your string where the match succeeded. You need to tell Perl to store them by surrounding the relevant parts of your regular expression with parentheses, and they may be referred to after the match as \1, \2, and so on. In this example, we check if the user has typed three consecutive four letter words:

while (<>) {
   /\b(\S{4})\s(\S{4})\s(\S{4})\b/
&& print "Gosh, you said $1 $2 $3!\n";
}

The first four-letter word lies between a word boundary (\b) and some white space (\s) and consists of four non-whitespace characters (\S). If matched, the matching substring is stored in the special variable \1 and the search continues. Once the search is complete, the backreferences may be referred to as $1, $2, and so on.

What if you don't know in advance how many matches to expect? Perform the match in an array context, and Perl returns the matches in an array. Consider this example:

@hits = ("Yon Yonson, Wisconsin" =~ /(\won)/g);
print "Matched on ", join(', ', @hits), ".\n";