by Paul Doyle
Before you get into the nitty-gritty of using Perl on World Wide Web servers, you need to take some time to look at Perl itself.
This chapter provides an overview of the Perl language. It is
not a detailed course in Perl, but it should give you enough Perl
to get by with; as you use the language; you'll probably want
to delve into more deeply after you've been programming in it
for a while.
The "Camel Book" |
When you're ready to learn more, you may want to purchase the excellent Programming Perl, by Larry Wall and Randal L. Schwartz (O'Reilly & Associates, Inc.). This book is the definitive work on Perl so far (as you might suspect with Wall's name on the cover). It's readable and humorous yet still sufficiently technical to be of genuine use in everyday Perl programming. Incidentally, the book is called the "Camel book" after the dromedary that happens to adorn the cover. Because of the ubiquitous nature of the book in Perl-literate circles, this animal has become the emblem of the language. |
We're not going to go into too much detail in this chapter; all the gory details are covered in Part V of this book. By the end of this chapter, you should know enough to find your way around the reference chapters for the answers to particular questions. If you already know Perl, you may want to just skim this chapter to refresh your memory of the language and how it works. If you don't already know how to program in at least one language, this book is not the place to start.
The story of how Perl began is a simple tale of one man's frustration
and (by his own account) inordinate laziness.
NOTE |
This chapter is supposed to be a snappy introduction to the language, so why am I wasting your time with this stuff? The fact is, Perl is a unique language in ways that cannot be conveyed simply by describing the technical details of the language. Perl is a state of mind as much as it is a language grammar. So we'll take a few minutes to look at the external realities that provoked Perl into being; this information should give you some insight into the way that Perl was meant to be used. |
Back in 1986, a UNIX programmer by the name of Larry Wall found himself working on a task that involved generating reports from a great many text files, with cross-references. Because he was a UNIX programmer, and because the problem involved manipulating the contents of text files, he started to use awk for the task. But it soon became clear that awk wasn't up to the job, and with no other obvious candidate for the job, he'd just have to write some code.
Now, here's the interesting bit: Larry could have written a utility to manage the particular job at hand and gotten on with his life. He could see, though, that it wouldn't be long before he'd have to write another special utility to handle something else that the standard tools couldn't quite hack. (He may have realized that most programmers are always writing special utilities to handle things that the standard tools can't quite hack.)
So rather than waste any more of his time, he invented a new language and wrote an interpreter for it. That statement may seem to be a paradox, but it isn't. Setting yourself up with the right tools is always an effort, but if you do it right, the effort pays off.
The new language emphasized system management and text handling. After a few revisions, it could handle regular expressions, signals, and network sockets, too. The language became known as Perl and quickly became popular with frustrated, lazy UNIX programmers-and with the rest of us.
Perl borrowed freely from many other tools, particularly sed and
awk. That's Perl, the language, not perl, the interpreter.
Perl does many of the things that sed, awk, and UNIX shell scripting
languages do, but (arguably) better every time; the perl code
is Larry's doing.
NOTE |
Is it Perl or perl? The definitive word from Larry Wall is that it doesn't matter. Many programmers like to refer to languages with capitalized names (Perl), but the program originated on a UNIX system, on which short lowercase names (awk, sed, and so on) are the norm. As is true of many things about the language, there's no single "right way" to use the term; just use it the way you want. Perl is a tool, after all, and not a dogma. If you're sufficiently pedantic, you may want to call it [Pp]erl after you read the "Regular Expressions" section later in this chapter. |
Perl can handle low-level tasks quite well, particularly since Perl 5, when the whole messy business of references was put on a sound footing. In this sense, it has a great deal in common with C. But Perl handles the internals of data types, memory allocation, and so on automatically and seamlessly.
Perl code also bears a passing resemblance to C code, perhaps because Perl was written in C or perhaps because Larry found some of C's syntactic conventions to be handy. But Perl is less pedantic and much more concise than C is.
This magpie habit of picking up interesting features along the way-regular expressions here, database handling there-has been regularized in Perl 5. Now you can add your favorite bag of tricks to Perl fairly easily by using modules. Many of the added-on features of Perl, such as socket handling, are likely to be dropped from the core of Perl and moved out to modules in time.
Perl is free. The full source and documentation are free to copy, compile, print, and give away. Any programs that you write in Perl are yours to do with as you please; there are no royalties to pay and no restrictions on distribution, as far as Perl is concerned.
Perl is not completely a public domain product, though, and for very good reason. If the source were completely public domain, someone could make minor alterations in it, compile it, and then sell it-in other words, rip off its creator. On the other hand, without distributing the source code, it's hard to make sure that everyone who wants to can use Perl.
The GNU General Public License is one way to distribute free software without the danger of being taken advantage of. Under this type of license, source code may be distributed freely and used by anybody, but any programs derived from such code must be released under the same type of license. In other words, if you derive any of your source code from GNU-licensed source code, you have to release your source code to anyone who wants it.
This arrangement is often sufficient to protect the interests of the author, but it can lead to a plethora of derivative versions of the original package, which may deprive the original author of a say in the development of his or her own creation. The situation can also lead to confusion on the part of users-it becomes hard to establish which version of the package is the definitive version, whether a particular script will work with a given version, and so on.
That's why Perl is released under the terms of the Artistic License-a variation on the GNU General Public License that says that anyone who releases a package derived from Perl must make it clear that the package is not actually Perl. All modifications must be clearly flagged; executables must be renamed, if necessary; and the original modules must be distributed along with the modified version. The effect is that the original author is clearly recognized as the owner of the package. The general terms of the GNU General Public license also apply.
New versions of Perl are released on the Internet and distributed to Web sites and FTP archives across the world. The Perl source and documentation are distributed, as are executable files for many non-UNIX systems. UNIX binaries are generally not made available on the Internet, because it generally is better to build Perl on your system so you can be certain that it will work. All UNIX systems have a C compiler, after all.
The perl distribution comes with a nifty utility called Configure that tweaks the source files and the Makefile for your system. It probes your system software, shell, C compiler, and so on to determine the answers to various questions about how to build Perl-which compiler flags to use, the sizes of fundamental data types, and so on. You can override any of Configure's answers if you disagree with its findings, but it's generally very accurate indeed.
Running Configure before you make perl virtually guarantees you a perl installation that is not only successfully compiled and linked, but also well optimized for your particular system configuration-and with no tweaking or editing of source files on your part. You're more than welcome to tinker with obscure compiler flags if you want, however; that's why GNU C was invented.
After you install perl, how do you use it to do all those wonderful things to enrich the Web? What, in other words, is a Perl program, and how do you feed it to perl?
We're going to spend the rest of this chapter answering the first two questions, so we'll get the third question out of the way now. Invoking perl is quite simple, but the procedure varies a little from system to system.
Suppose that perl is correctly installed and working on your system. The simplest way to run perl on a Perl program is to invoke the perl interpreter with the name of the Perl program as an argument, as follows:
perl sample.pl
In this example, SAMPLE.PL is the name of a Perl file, and perl is the name of the perl interpreter. The example assumes that perl is in the execution path. If it isn't, you need to supply the full path to perl, too, as follows:
/usr/local/hin/perl sample.pl
This syntax is the preferred way of invoking perl, because it eliminates the possibility that you might invoke a copy of perl other than the one you intended to use. Because we'll be working with Web servers in this book-and, therefore, keenly aware of security issues-we'll use the full path from now on.
That much is the same on all systems that have a command-line interface. The following will do the trick in Windows NT, for example:
c:\NTperl\perl sample.pl
Invoking Perl in UNIX UNIX systems have another way to invoke an interpreter in a script file. Place a line such as the following at the start of the Perl file:
#!/usr/local/bin/perl
This line tells UNIX that the rest of this script file is to be interpreted by /USR/LOCAL/BIN/PERL. Next, you make the script itself executable, as follows:
chmod +x sample.pl
Then you can execute the script file directly and have the script file tell the operating system what interpreter to use while running it.
Invoking Perl in Windows NT The procedures in
the preceding section are fine for UNIX, but Windows NT is quite
different. You can use File Manager (Explorer, in Windows NT 4)
to create an association between the file extension, .PL, and
the perl executable. Then, whenever a file that ends in .PL is
invoked, NT knows that perl should be used to interpret it.
NOTE |
Usually, a few more steps are required to get a Web server to execute Perl programs automatically. Refer to Appendix A, "Perl Acquisition and Installation," for platform-specific instructions on creating associations between scripts and interpreters. |
Perl takes several optional command-line arguments for various
purposes (see Table 1.1). Most of these arguments are rarely used
but are listed here for reference purposes. The -t switch
in particular is de rigueur in Web-based Perl scripts.
Arguments | Purpose | Notes | |
Octal character code | Specify record separator | Default is new line (\n) | |
Automatically split records | Used with -n or -p | ||
Check syntax only do not execute | |||
Run script, using Perl debugger | If Perl debugger is installed | ||
Flags | Specify debugging behavior | Refer to the PERLDEBUG man page on the CD-ROM that comes with this book | |
Command | Pass a command to Perl from the command line | Useful for quick operations; see tip after this table for an example | |
Regular expression | Expression to split by if -a is used | Default is white space | |
Extension | Replace original file with result | Useful for modifying contents of files; see tip after this table for an example | |
Directory | Specify location of include files | ||
Octal character code | Drop new lines when used with -n and -p, and use designated character as line-termination character | ||
Process the script, using each specified file as an argument | Used for performing the same set of actions on a set of files | ||
Same as -n, but each line is printed | |||
Run the script through the C preprocessor before Perl compiles it | |||
Enable passing of arbitrary switches to Perl | Use -s -what -ever to have the Perl variables $what and $ever defined within your script | ||
Tell Perl to look along thepath for the script | |||
Use taint checking; don't evaluate expressions supplied in the command line | Very important for Web use | ||
Makes Perl dump core after compiling your script; intended to allow for generation of Perl executables | Very messy; wait for the Perl compiler | ||
Unsafe mode; overrides Perl's natural caution. | Don't use this! | ||
Print Perl version number | |||
Print warnings about script syntax | Extremely useful, especially during development; warning messages can confuse browsers if sent raw |
TIP |
The -e option is handy for quick Perl operations from the command line. Want to change all the foos in WIFFLE.BAT to bars? Try this: perl -i.old -p -e "s/foo/bar/g" wiffle.bat This code says, "Take each line of WIFFLE.BAT (-p), store the original in WIFFLE.OLD (-i), replace all instances of foo with bar (-e), and write the result (-p) to the original file (-i)." |
You can supply Perl command-line arguments in the interpreter-invocation line in UNIX scripts. Following is a good start for any Perl script:
#!/usr/local/bin/perl -w -T
CAUTION |
The -w switch is best omitted in versions of Perl older than 5.002, because it may produce spurious warnings. Also, take care when you use the -w switch in scripts that send data to Web browsers. Warning messages sent before the browser receives a content-type line may result in an error message. |
A Perl program consists of an ordinary text file that contains a series of Perl commands. Commands are written in what looks like a bastardized amalgam of C, shell script, and English. In fact, that's pretty much what it is.
Perl code can be quite free-flowing. The broad syntactic rules governing where a statement starts and ends are:
Here's a Perl statement inspired by Kurt Vonnegut:
print "My name is Yon Yonson\n";
No prizes for guessing what happens when Perl runs this code-it prints My name is Yon Yonson. If the \n doesn't look familiar, don't worry; it simply means that Perl should print a new-line character after the text (or, in other words, go to the start of the next line).
Printing more text is a matter of either stringing together statements like the following or giving multiple arguments to the print function:
print "My name is Yon Yonson,\n"; print "I live in Wisconsin,\n", "I work in a lumbermill there.\n";
That's right-print is a function. It may not look like one in any of the earlier examples in this chapter, which have no parentheses to delimit the function arguments, but it is a function, and it takes arguments. More accurately, in this example print takes a single argument that consists of an arbitrarily long list.
We'll have much more to say about lists and arrays in "Data Types" later in this chapter. You'll find a few more examples of the more common functions in the remainder of this chapter, but refer to Chapter 15, "Function List," for a complete rundown on Perl's built-in functions.
For now, if you're uncomfortable with functions that take arbitrary numbers of arguments with no parentheses to corral them, pretend that you see parentheses. You can use them in Perl programs, if you like, but it would be better to get used to the idea that Perl syntax is loose and groovy in a way that C, for example, is not.
What does a complete Perl program look like? Here's a trivial UNIX example, complete with the invocation line at the top and a few comments:
#!/usr/local/bin/perl -w # Show warnings print "My name is Yon Yonson,\n"; # Let's introduce ourselves print "I live in Wisconsin,\n", "I work in a lumbermill there.\n"; # Remember the line breaks
This example is not at all typical of a Perl program, though; it's just a linear sequence of commands with no structural complexity. The "Flow Control" section later in this chapter introduces some of the constructs that make Perl what it is and provides a more authentic flavor of what is normal in a Perl program. For now, we'll stick to simple examples like this one for the sake of clarity.
Perl has a small number of data types. If you're used to working with C, in which even characters can be either signed or unsigned, this fact makes for a pleasant change. In essence, Perl has only two data types: scalars and arrays. Perl also has associative arrays, which are a very special type of array and which merit a section of their own.
All numbers and strings are scalars. Scalar-variable names
start with a dollar sign ($).
NOTE |
All Perl variable names, including scalars, are case-sensitive. $Name and $name, for example, are completely different quantities. |
Perl converts automatically between numbers and strings as required, so that
$a = 2; $b = 6; $c = $a . $b; # The "." operator concatenates two strings $d = $c / 2; print $d;
yields the result
13
This example involves converting two integers to strings; concatenating the strings into a new string variable; converting this new string to an integer; dividing it by 2; converting the result to a string; and printing it. All these conversions are handled implicitly, leaving the programmer free to concentrate on what needs to be done rather than on the low-level details of how it is to be done.
This situation might be a problem if Perl were regularly used for tasks in which explicit memory offsets were used, for example, and data types were critical. But for the type of task for which Perl is normally used-and certainly for the types of tasks that we'll be using it for in this book-these automatic conversions are smooth, intuitive, and generally a Good Thing.
We can develop the earlier example script with some string variables, as follows:
#!/usr/local/bin/perl -w # Show warnings $who = 'Yon Yonson'; $where = 'Wisconsin'; $what = 'in a lumbermill'; print "My name is $who,\n"; # Let's introduce ourselves print "I live in $where,\n", "I work $what there.\n"; # Remember the line breaks print "\nSigned: \t$who,\n\t\t$where.\n";
This script yields the following:
My name is Yon Yonson, I work in Wisconsin, I work in a lumbermill there. Signed: Yon Yonson, Wisconsin.
Don't worry-it gets better.
A collection of scalars is an array. An array-variable name starts with an at symbol (@), whereas an explicit array of scalars is written as a comma-separated list within parentheses, as follows:
@trees = ("Larch", "Hazel", "Oak");
Array subscripts are denoted by brackets. $trees[0], for example, is the first element of the @trees array. Notice that it's @trees but $trees[0]; individual array elements are scalars, so they start with $.
Mixing scalar types in an array is not a problem. The code
@items = (15, '45.67', "case"); print "Take $items[0] $items[2]s at \$$items[1] each.\n";
results in the following:
Take 15 cases at $45.67 each.
All arrays in Perl are dynamic. You never have to worry about memory allocation and management; Perl does all that stuff for you. Combine that with the fact that arrays can contain arrays as subarrays, and you're free to say things like the following:
@A = (1, 2, 3); @B = (4, 5, 6); @C = (7, 8, 9); @D = (@A, @B, @C);
As a result of this code, the array @D contains the numbers 1 through 9. The power of constructs such as the following takes getting used to:
@Annual = (@Spring, @Summer, @Fall, @Winter);
This code example combines arrays that represent some aspect of
each of the seasons in a concise and intuitive way. The arrays
for the seasons might in turn consist of arrays of months, each
of which might consist of an array of daily values. The @Annual
array then would consist of a value for each day of the year.
By defining your data in chunks such as this, you give yourself
the option of handling it on a daily, monthly, or annual basis.
NOTE |
An aspect of Perl that often confuses newcomers (and occasionally old hands, too) is the context-sensitive nature of evaluations. Perl keeps track of the context in which an expression is being evaluated and can return a different value in an array context than in a scalar context. In this example, the array @B contains 1-4, whereas $C contains 4 (the number of values in the array): @A = (1, 2, 3, 4); This context sensitivity becomes more of an issue when you use functions and operators that can take either a single argument or multiple arguments. The function or argument behaves one way when it is passed a single scalar argument and another when it is passed multiple arguments, which it may interpret as a single array argument. |
Many of Perl's built-in functions take arrays as arguments. One example is sort, which takes an array as an argument and returns the same array, sorted alphabetically. The code
print sort ( 'Beta', 'Gamma', 'Alpha' );
prints AlphaBetaGamma.
You can make this code neater by using another built-in function, called join. This function takes two arguments: a string to connect with, and an array of strings to connect. join returns a single string that consists of all elements in the array joined with the connecting string. The code
print join ( ' : ', 'Name', 'Address', 'Phone' );
returns the string Name : Address : Phone.
Because sort returns an array, you can feed its output straight into join. The code
print join( ', ', sort ( 'Beta', 'Gamma', 'Alpha' ) );
prints Alpha, Beta, Gamma.
Notice that this code doesn't separate the initial scalar argument of join from the array that follows it. The first argument is the string to join things with. The rest of the arguments are treated as a single argument: the array to be joined. This is true even if you use parentheses to separate groups of arguments. The code
print join( ': ', ('A', 'B', 'C'), ('D', 'E'), ('F', 'G', 'H', 'I'));
returns A: B: C: D: E: F: G: H: I.
You can use one array or multiple arrays in a context such as
this because of the way that Perl treats arrays; adding an array
to an array gives you one larger array, not two arrays. In this
case, all three arrays are bundled into one.
TIP |
For even more powerful string-manipulation capabilities, refer to the splice function in Chapter 15, "Function List." |
Associative arrays have a certain elegance that makes experienced Perl programmers a little snobbish about their language of choice. Rightly so! Associative arrays give Perl a degree of database functionality at a very low, yet useful, level. Many tasks that would otherwise involve complex programming can be reduced to a handful of Perl statements by means of associative arrays.
Arrays of the type that you've already seen are lists of values indexed by subscripts. In other words, to get an individual element of an array, you supply a subscript as a reference, as follows:
@fruit = ( "Apple", "Orange", "Banana" ); print $fruit[2];
This example yields Banana, because subscripts start at zero, so 2 is the subscript for the third element of the @fruit array. A reference to $fruit[7] here returns the null value, because no array element with that subscript has been defined.
Now, here's the point of all this: Associative arrays are lists of values indexed by strings. Conceptually, that's all there is to them. The implementation of associative arrays is more complex, because all the strings (keys) need to be stored in addition to the values to which they refer.
When you want to refer to an element of an associative array, you supply a string (the key) instead of an integer (the subscript). Perl returns the corresponding value. Consider the following example:
%fruit = ("Green", "Apple", "Orange", "Orange", "Yellow", "Banana" ); print $fruit{"Yellow"};
This code prints Banana, as before. The first line defines the associative array in much the same way that you have already defined ordinary arrays; the difference is that instead of listing values, you list key/value pairs. The first value is Apple, and its key is Green. The second value is Orange, which happens to have the same string for both value and key. Finally, the value Banana has the key Yellow.
On a superficial level, you can use string subscripts to provide mnemonics for array references, allowing you to refer to $Total{'June'} instead of $Total[5]. But you wouldn't even be beginning to use the power of associative arrays. Think of the keys of an associative arrays as you might think of a key that links tables in a relational database, and you're closer to the idea. Consider this example:
%Folk = ( 'YY', 'Yon Yonson', 'TC', 'Terra Cotta', 'RE', 'Ron Everly' ); %State = ( 'YY', 'Wisconsin', 'TC', 'Minnesota', 'RE', 'Bliss' ); %Job = ( 'YY', 'work in a lumbermill', 'TC', 'teach nuclear physics', 'RE', 'watch football'); foreach $person ( 'TC', 'YY', 'RE' ) { print "My name is $Folk{$person},\n", "I live in $State{$person},\n", "I $Job{$person} there.\n\n"; }
We had to sneak the foreach construct in there for that example to work. That construct is explained in full in "Flow Control" later in this chapter. For now, you'll just have to take it on trust that foreach makes Perl execute the three print statements for each of the people in the list after the foreach keyword. Otherwise, you could try executing the code in the sample and see what happens.
You also can treat the keys and values of an associative array as separate (ordinary) arrays by using the keys and values keywords, respectively. The code
print keys %Folk; print values %State;
prints the string YYRETCWisconsinBlissMinnesota.
Looks as though we need to do some more work on string handling.
That task is best left until after we cover some flow-control
mechanisms, however.
NOTE |
A special associative array called %ENV stores the contents of all environment variables, indexed by variable name. $ENV{'PATH'}, for example, returns the current search path. Following is a way to print the current values of all environment variables, sorted by variable name for good measure: foreach $var (sort keys %ENV ) { The foreach clause sets $var to each of the environment-variable names in turn (in alphabetical order), and the print statement prints each name and value. The backslash-quote (\") in there produces quotation marks around the values. |
This chapter finishes discussing Perl data types by discussing file handles. A file handle is not really a data type at all, but a special kind of literal string. A file handle behaves like a variable in many ways, however, so this section is a good place to cover them. (Besides, you won't get very far in Perl without them.)
You can regard a file handle as being a pointer to a file from which Perl is to read or to which it will write. (C programmers are familiar with the concept.) The basic idea is that you associate a handle with a file or device, and then refer to the handle in the code whenever you need to perform a read or write operation.
File handles generally are written in uppercase. Perl has some
useful predefined file handles, as Table 1.2 shows.
File Handle | Points to |
STDIN | Standard input (normally, the keyboard) |
STDOUT | Standard output (normally, the console; in many Web applications, the browser) |
STDERR | Device where error messages should be written (normally, the console; in a Web server environment, normally, the server-error log file) |
The print statement can take a file handle as its first argument, as follows:
print STDERR "Oops, something broke.\n";
Notice that no comma appears after the file handle in this example. That helps Perl figure out that the STDERR is not something to be printed. If you're uneasy with this implicit list syntax, you can put parentheses around all the print arguments, as follows:
print (STDERR "Oops, something broke.\n");
You still have no comma after the file handle, however.
TIP |
Use the standard file handles explicitly, especially in complex programs. Redefining the standard input or output device for a while is convenient sometimes; make sure that you don't accidentally wind up writing to a file what should have gone to the screen. |
You can use the open function to associate a new file handle with a file, as follows:
open (INDATA, "/etc/stuff/Friday.dat"); open (LOGFILE, ">/etc/logs/reclaim.log"); print LOGFILE "Log of reclaim procedure\n";
By default, open opens files for reading only. If you
want to override this default behavior, add to the file name one
of the special direction symbols listed in Table 1.3. (The >
at the start of the file name in the second output statement
of the preceding example, for example, tells Perl that you intend
to write to the named file.)
Symbol | Meaning |
< | Open the file for reading (the default action) |
> | Open the file for writing |
>> | Open the file for appending |
+< | Open the file for both reading and writing |
+> | Open the file for both reading and writing |
| (before file name) | Treat file as command into which Perl is to pipe text |
| (after file name) | Treat file as command from which input is to be piped to Perl |
To take a more complex example, here's one way to feed output to the mypr printer on a UNIX system:
open (MYLPR, "|lpr -Pmypr"); print MYLPR "A line of output\n"; close MYLPR;
A special Perl operator for reading from files consists of two angle brackets-<>-around the file handle of the file from which you want to read. This operator returns the next line or lines of input from the file or device, depending on whether the operator is used in a scalar or an array context. When no more input remains, the operator returns false.
A construct such as
while (<STDIN>) { print; }
simply echoes each line of input back to the console until Ctrl+D (Ctrl+Z in Windows NT) is pressed, because the print function takes the current default argument here: the most recent line of input. For an explanation, see "Special Variables" later in this chapter.
If the user types
A Bb Ccc ^D
the screen looks like this:
A A Bb Bb Ccc Ccc ^D
Notice that in this case, <STDIN> is in a scalar context, so one line of standard input is returned at a time. Compare that example with the following example:
print <STDIN>;
In this case, because print expects an array of arguments (it can be a single-element array, but it's an array as far as print is concerned), the <> operator obligingly returns all the contents of STDIN as an array, and then print prints it. Because the array is fully built before it is printed, nothing is written to the console until the user presses Ctrl+D:
A Bb Ccc ^D A Bb Ccc
This script prints out the contents of the file .SIGNATURE, double-spaced:
open (SIGFILE, ".signature"); while ( <SIGFILE> ) { print; print "\n"; }
The first print here has no arguments, so it takes the current default argument and prints it. The second print has an argument, so it prints that instead. Perl's habit of using default arguments extends to the <> operator; if that operator is used with no file handle, Perl assumes that <ARGV> is intended. <ARGV> expands to each line in turn of each file listed in the command line.
If no files are listed in the command line, Perl instead assumes that STDIN is intended. The following code, therefore, keeps printing more as long as something other than Ctrl+D appears in standard input:
while (<>) { print "more.... "; }
NOTE |
Perl 5 allows array elements to be references to any data type. As a result, you can build arbitrary data structures of the kind used in C and other high-level languages, but with all the power of Perl. You can have an array of associative arrays, for example. |
Like all languages, Perl has its special hieroglyphs, which are laden with meaning. This section briefly examines some of the most common and useful variables, and provides some examples of typical Perl idioms in which you might find them.
You have already seen one special variable: the environment-variable associative array %ENV. This special associative array allows you to easily use the value of any environment variable within your Perl scripts:
print "Looking for files along the path ($ENV{'PATH'}) \n";
The %ENV array is quite useful in CGI programming, in which parameters are passed from the browser to CGI programs as environment settings.
Any arguments specified in the Perl command line are passed to
the Perl script in another special array: @ARGV.
CAUTION |
C programmers, beware: The first element of this array is the first actual argument, not the name of the program. The special variable $0 contains the name of the Perl script that is being executed. |
The following code prints the command-line arguments one per line, sorted alphabetically:
print join("\n", sort @ARGV);
The command-line arguments are of limited use in CGI scripts, in which arguments are passed via the environment rather than the command line. These arguments are quite useful in normal Perl work, of course.
The special variable $_ is often used to store the current line of input. This situation is true when the <> input operator is used. The following code, for example, prints a numbered listing of the file pointed to by SOMEFILE:
$line=0; while ( <SOMEFILE> ) { ++$line; print "Line $line : ", $_; }
You occasionally need to store the contents of $_ somewhere, as in the following example:
$oldvalue = $_;
But the opposite operation-setting the value of $_ manually-is rarely appropriate, as in this example:
$_ = $oldvalue;
Pattern matching and substitution take place on the contents of this variable unless you specify otherwise. These topics are covered in "Regular Expressions" later in this chapter.
The special variable $! contains the current system-error number (errno, on UNIX systems) or system-error string, depending on whether it is evaluated in a numeric or string context. This variable may not contain anything meaningful; it should be used only if an error occurred.
This example reports failure if the open call failed:
open ( INFILE, "./missing.txt") || die "Couldn't open \"./missing.txt\" ($!).\n";
The || here is the Boolean or operator, which is covered in "Flow Control" later in this chapter. die causes Perl to terminate after printing the string given to die as an argument.
If the file does not exist, Perl terminates after displaying something like this:
Couldn't open "./missing.txt" (No such file or directory).
The form and content of error messages vary from one system to the next.
The examples that you have seen so far have been quite simple, with little or no logical structure beyond a linear sequence of steps. We managed to sneak in the occasional while and foreach; think of those as being sneak previews. Perl has all the flow-control mechanisms that you'd expect to find in a high-level language, and this section takes you through the basics of each mechanism.
Two operators-|| (or) and && (and)-are used like glue to hold Perl programs together. They take two operands and return either true or false, depending on the operands. In the following example, if either $Saturday or $Sunday is true, $Weekend will be true, too:
$Weekend = $Saturday || $Sunday;
In the next example, $Solvent is true only if $income is greater than 3 and $debts is less than 10:
$Solvent = ($income > 3) && ($debts < 10);
Now consider the logic of evaluating one of these expressions. It isn't always necessary to evaluate both operands of either an && or a || operator. In the first example earlier in this section, if $Saturday is true, you know that $Weekend will be true, regardless of whether $Sunday is also true (the midnight condition, perhaps?).
This means that when the left side of an or expression is evaluated as true, the right side is not evaluated. Combine this with Perl's easy way with data types, and you can say things like the following:
$value > 10 || print "Oops, low value \n";
If $value is greater than 10, the right side of the expression is never evaluated, so nothing is printed. If $value is not greater than 10, Perl needs to evaluate the right side, too, so as to decide whether the expression as a whole is true or false. That means that Perl evaluates the print statement, printing out the message.
OK, it's a trick, but it's a very useful one.
Something analogous applies to the && operator. In this case, if the left side of an expression is false, the expression as a whole is false, so Perl does not evaluate the right side. The && operator can, therefore, be used to produce the same kind of effect as the || trick, but with the opposite sense, as in the following example:
$value > 10 && print "OK, value is high enough \n";
As is true of most Perl constructs, the real power of these tricks comes when you apply a little creative thinking. Remember that the left and right sides of these expressions can be any Perl expressions; think of them as being conjunctions in a sentence rather than logical operators, and you'll get a better feel for how to use them. Expressions such as the following give you a little of the flavor of creative Perl:
$length <= 80 || die "Line too long.\n"; $errorlevel > 3 && warn "Hmmm, strange error level ($errorlevel) \n"; open ( LOGFILE, ">install.log") || &bust("Log file");
The &bust in this example is a subroutine call, by the way. Refer to "Subroutines" later in this chapter for more information.
The most basic kind of flow control is a simple branch. A statement is either executed or not, depending on whether a logical expression is true or false. You can do this by following the statement with a modifier and a logical expression, as follows:
open ( INFILE, "./missing.txt") if $missing;
The execution of the statement is contingent upon both the evaluation of the expression and the sense of the operator.
The expression is evaluated as either true or false and can contain any of the relational operators listed in Table 1.4 (although it need not). Following are a few examples of valid expressions:
$full $a == $b <STDIN>
Operator | ||
Equality | ||
Inequality | ||
Inequality with signed result | ||
Greater than | ||
Greater than or equal to | ||
Less than | ||
Less than or equal to |
NOTE |
When we're comparing strings, less than means lexically less than. If $left comes before $right when the two are sorted alphabetically, $left is less than $right. |
Perl has four modifiers, each of which behaves the way that you might expect from the corresponding English word:
Notice that the logical expression is evaluated only one time in the case of if and unless, but multiple times in the case of while and until. In other words, the first two are simple conditionals, and the last two are loop constructs.
The syntax changes when you want to make the execution of multiple statements contingent on the evaluation of a logical expression. The modifier comes at the start of a line, followed by the logical expression in parentheses, followed by the conditional statements in braces. Notice that the parentheses around the logical expression are required, although they are not required in the single statement branching described in the preceding section.
The following example is somewhat similar to C's if syntax:
if ( ( $total += $value ) > $limit ) { print LOGFILE "Maximum limit $limit exceeded. Offending value was $value.\n"; close (LOGFILE); die "Too many! Check the log file for details.\n"; }
The if statement is capable of a little more complexity, with else and elsif operators, as in the following example:
if ( !open( LOGFILE, "install.log") ) { close ( INFILE ); die "Unable to open log file!\n"; } elsif ( !open( CFGFILE, ">system.cfg") ) { print LOGFILE "Error during install: Unable to open config file for writing.\n"; close ( LOGFILE ); die "Unable to open config file for writing!\n"; } else { print CFGFILE "Your settings go here!\n"; }
The loop modifiers (while, until, for, and foreach) are used with compound statements in much the same way, as the following example shows:
until ( $total >= 50 ) { print "Enter a value: "; $value = scalar (<STDIN>); $total += $value; print "Current total is $total\n"; } print "Enough!\n";
The while and until statements are described in "Conditional Expressions" earlier in this chapter. The for statement resembles the one in C. for is followed by an initial value, a termination condition, and an iteration expression, all enclosed in parentheses and separated by semicolons, as follows:
for ( $count = 0; $count < 100; $count++ ) { print "Something"; }
The foreach operator is special; it iterates over the contents of an array and executes the statements in a statement block for each element of the array. Following is a simple example:
@numbers = ("one", "two", "three", "four"); foreach $num ( @numbers ) { print "Number $num \n"; }
The variable $num first takes on the value one, then two, and so on. That example looks fairly trivial, but the real power of this operator lies in the fact that it can operate on any array, as follows:
foreach $arg ( @ARGV ) { print "Argument: \"$arg\".\n"; } foreach $namekey ( sort keys %surnames ) { print REPORT "Surname: $value{$namekey}.\n", "Address: $address{$namekey}.\n"; }
You can use labels with the next, last, and redo statements to provide more control of program flow through loops. A label consists of any word, usually in uppercase, followed by a colon. The label appears just before the loop operator (while, for, or foreach) and can be used as an anchor for jumping to from within the block. The following code snippet prints all the odd-numbered records in INFILE:
RECORD: while ( <INFILE> ) { $even = !$even; next RECORD if $even; print; }
The three label-control statements are:
Subroutines in Perl are defined with the sub keyword, as follows:
sub Usage { print "Usage: \n", "twiddle [-args] infile outfile\n"; print "Copyleft 1996, Jonathan F. Squirmsby."; }
Subroutines are called with &, as follows:
sub bust { print "Oops, some kind of error seems to have occurred.\n"; die "Fatal error, terminating.\n"; } open ( LOGFILE, ">install.log") || &bust;
In this example, the subroutine was defined before it was called. You can define and call subroutines in any order in Perl; the convention is to define them after the main routine.
Passing Arguments You can pass arguments to a subroutine in the usual way, as follows:
open ( LOGFILE, ">install.log") || &bust("Failed to open log file \"install.log\".");
But here is where Perl's subroutine syntax starts to get a little strange; C programmers may want to take a seat before reading on.
All Perl subroutines receive their arguments as an arbitrarily long array of scalars with the special name of @_. There is no mechanism for declaring the arguments when the subroutine is declared. There is no fixed number of arguments. Also, the calling function can pass any mixture of scalars and arrays; they are all treated as one big @_ array when they get to the subroutine.
In the example earlier in this section, in which bust is called with a single argument, you can pick it up in the subroutine and use it to provide a more sensible error message, as in the following example:
sub bust { ($errortext) = @_; print "Oops, an error occurred ($errortext).\n"; die "Fatal error, terminating.\n"; }
Notice that we went to the trouble of assigning the scalar $errortext to the argument array @_. This assignment may seem to be unnecessary; in fact, we could have simply used @_ instead of $errortext in the print statement. Explicitly assigning variables to the contents of the @_ array is much clearer, though, especially when the subroutine takes multiple arguments. Compare the example
print "Error $_[0] opening file $_[1].\n";
with this one:
($errfile, $errtext) = @_; print "Error $errtext opening file $errfile.\n";
Notice, too, that when we assigned the single value $errortext to the @_ array in the bust example, we placed it in parentheses. We did so to force an array context, so that what gets assigned to $errortext is the first (and only) value of the @_ array, not the number of values in @_. In effect, we're telling Perl to treat $errortext as a single-element array. The earlier example that uses $errfile and $errtext is a clearer example of an array-to-array assignment.
In "Variable Scope" later in this chapter, you learn how to protect local variables such as $errortext in subroutines by using the local and my keywords.
Passing Arrays Perl's grouping of all subroutine arguments makes it impossible to pass more than one array to a Perl subroutine. Suppose that you have a subroutine call of the following form:
&PrintRes( "alpha", (1, 3, 5, 7), "beta", (2, 4, 6, 8) );
Try to unpack these arguments into the following values as they come into the subroutine:
$p1 = "alpha"; @p2 = (1, 3, 5, 7); $p3 = "beta"; @p4 = (2, 4, 6, 8);
A statement like
( $p1, @p2, $p3, @p4 ) = @_;
won't get beyond the second parameter. The following list explains what happens:
There's no point in trying to specify subarrays, as in the following example, because Perl expands the array on the left to the same thing as before:
( $p1, (@p2), $p3, (@p4) ) = @_;
The moral of the story is: Don't pass more than one array into a subroutine. And if you do pass an array, make sure that it's the last argument.
Returning Values Perl is just as casual about returning values from subroutines as it is about passing arguments to them. A subroutine returns a single value: the value of the last assign-ment made in the subroutine. If you pass (4, 3) to this subroutine, the value 7 is returned:
sub AddIt { ( $a, $b ) = @_; $a + $b; }
That means that the value 7 is substituted for the subroutine call after evaluation. The code
print "Summing 4 and 3 yields ", &AddIt(4, 3), ".\n";
prints the following:
Summing 4 and 3 yields 7.
Notice that we had to keep the subroutine call outside the quotes to allow Perl to recognize & as a subroutine invocation.
It isn't always clear which statement is the last to be executed in a subroutine, particularly if it contains loops or conditional statements. One way to ensure that the correct value is returned is to place a reference to the variable on a line by itself at the end of the subroutine, as follows:
sub Maybe { # Various loops and conditionals here which set the value of "$result" $result; }
CAUTION |
Take care not to add seemingly innocuous statements near the end of a subroutine. A print statement returns a value of 1 (if successful) for example, so a subroutine that prints something just before it returns always returns 1. |
The return value can be a scalar, an array, or an associative array. Listing 1.1 shows a complete example in which a subroutine builds an associative array of names keyed by initials and then returns the associative array. The keys of this array-the initials-are then printed in sorted order. Take your time reading through this example; a lot is going on in there, but it's comprehensively commented.
Listing 1.1 INITIALS.PL: Returning an Associative Array from a Subroutine
#!/usr/local/bin/perl -w # Pass the names into the subroutine. # Store the results in an associative array called "keyedNames". %keyedNames = &GetInitials("Jane Austen", "Emily Bronte", "Mary Shelley" ); # Print out the initials, sorted: print "Initials are ", join(', ', sort keys %keyedNames), ".\n"; # The GetInitials subroutine. sub GetInitials { # Let's store the arguments in a "names" array for clarity. @names = @_; # Process each name in turn: foreach $name ( @names ) { # The "split" function is explained in Chapter 15, "Function List". # In this statement, we're getting split to look for the ' ' in the name; # It returns an array of chunks of the original string (i.e. $name) which were # separated by spaces, i.e. the forename and surname respectively in our case. # The variables "$forename" and "$surname" are then assigned to this array # using parentheses to force an array assignment. ( $forename, $surname ) = split( ' ', $name ); # OK, now we have the forename and surname. We use the "substr" function, # also explained in chapter 15, to extract the first character from each of these. # The "." operator concatenates two strings (for example, "aa"."bb" is "aabb") # so the variable "$inits" takes on the value of the initials of the name: $inits = substr( $forename, 0, 1 ) . substr( $surname, 0, 1 ); # Now we store the name in an associative array using the initials as the key: $NamesByInitials{$inits} = $name; } # Having built the associative array, we simply refer to it at the end of the # subroutine so that it's value is the last thing evaluated here. It will then # be passed back to the calling function. %NamesByInitials; }
Perl uses separate name spaces to store scalars, arrays, associative arrays, and so on. As a result, you can use the same name for variables of different types without fear of confusion (at least on Perl's part; for your own sake, use unique names). This example uses three different kinds of variables, each called name:
$name = "Dana"; @name = ("Donna", "Dana", "Diana"); %name = ("Donna", "Elephants", "Dana", "Finches", "Diana", "Parakeets"); print "I said $name{$name}, not $name{$name[0]}!\n";
The bad news is that by default, Perl uses just one name space for each data type, for all functions. So if you have a variable called $temp in the main function, and you call a routine that uses another variable called $temp, the value of $temp in the main function gets clobbered. The references to the two variables are in fact two references to the same variable, as far as Perl is concerned.
That's where the local (Perl 4 and 5) and my (Perl 5 only) functions come in. These functions force Perl to treat variables as though they are local to the current code block, whether that block is a loop, an if-block, or a subroutine.
The following example uses two variables called $temp (one outside and one inside a while loop):
$temp = "Still here!\n"; print "Enter a few words at a time, Ctrl+D to terminate:\n"; while (<>) { local( $temp, @etc ) = split(' ', $_ ); print "You said $temp"; @etc && print " and then you said @etc"; print ". Enter some more, or press Ctrl+D to end:\n"; } print $temp;
The difference between Perl 4's local() and Perl 5's my() is that local variables are local to the current package, whereas my variables are really local.
We'll finish this overview of Perl by discussing its pattern-matching capabilities. The capability to match and replace patterns is vital to any scripting language that claims to be capable of useful text manipulation. By this stage, you probably won't be surprised to read that Perl matches patterns better than any other general-purpose language does. Perl 4's pattern matching is excellent, but Perl 5 introduces some significant improvements, including the capability to match on even more arbitrary strings than before.
The basic pattern-matching operations discussed in this section are:
The patterns referred to here are more properly known as regular expressions, and we'll start by looking at them.
A regular expression is a set of rules that describes a generalized string. If the characters that make up a particular string conform to the rules of a particular regular expression, the regular expression is said to match that string.
A few concrete examples usually help after an overblown definition like that one. The regular expression b. matches the strings bovine, above, Bobby, and Bob Jones, but not the strings Bell, b, or Bob. That's because the expression insists that the letter b (lowercase) must be in the string and must be followed immediately by another character.
The regular expression b+, on the other hand, requires the lowercase letter b at least once. This expression matches b and Bob in addition to the example matches for b. in the preceding paragraph. The regular expression b* requires zero or more bs, so it matches any string. That seems to be fairly useless, but it makes more sense as part of a larger regular expression. Bob*y, for example, matches all of Boy, Boby, and Bobby but not Boboby.
Assertions Several so-called assertions are
used to anchor parts of the pattern to word or string boundaries.
The ^ assertion matches
the start of a string, so the regular expression ^fool
matches fool and
foolhardy but not tomfoolery
or April fool.
Table 1.5 lists the assertions.
Matches | Example | Matches | Doesn't Match | |
Start of string | ^fool | foolish | tomfoolery | |
End of string | fool$ | April fool | foolish | |
Word boundary | be\bside | be side | beside | |
Nonword boundary | be\Bside | beside | be side |
Atoms The . (period) that you saw in b. earlier
in this chapter is an example of a regular-expression atom. Atoms
are, as the name suggests, the fundamental building blocks of
a regular expression. A full list of atoms appears in Table 1.6.
Atom | Matches | Example | Matches | Doesn't Match |
period (.) | Any character except new line | b.b | bob | bb |
List of characters in brackets | Any one of those characters | ^[Bb] | Bob, bob | Rbob |
Regular expression in parentheses | Anything that regular expression matches | ^a(b.b)c$ | abobc | abbc |
Quantifiers A quantifier is a modifier
for an atom. It can be used to specify that a particular atom
must appear at least once, as in b+. The atom quantifiers are
listed in Table 1.7.
Quantifier | Matches | Example | Matches | Doesn't Match |
* | Zero or more instances of the atom | ab*c | ac, abc | abb |
+ | One or more instances of the atom | ab+c | abc | ac |
? | Zero or one instances of the atom | ab?c | ac, abc | abbc |
{n} | n instances of the atom | ab{2}c | abbc | abbbc |
{n,} | At least n instances of the atom | ab{2,}c | abbc, abbbc | abc |
{nm} | At least n, most m instances of the atom | ab{2,3}c | abbc | abbbbcat |
Special Characters Several special characters
are denoted by backslashed letters, with \n
being especially familiar to C programmers, perhaps. Table
1.8 lists the special characters.
Matches | ||||
Any digit | ||||
Nondigit | ||||
New line | ||||
Carriage return | ||||
Tab | ||||
Form feed | ||||
White-space character | ||||
Non-white-space character | ||||
Alphanumeric character | ||||
Nonalphanumeric character |
Backslashed Tokens It is essential that regular expressions be capable of using all characters, so that all possible strings that occur in the real word can be matched. With so many characters having special meanings, a mechanism is required that allows you to represent any arbitrary character in a regular expression.
This mechanism is a backslash (\), followed by a numeric quantity. This quantity can take any of the following formats:
Now you're ready to start putting all that information together with some real pattern matching. The match operator normally consists of two forward slashes with a regular expression in between, and it normally operates on the contents of the $_ variable. So if $_ is serendipity, /^ser/, /end/, and /^s.*y$/ are all true.
Matching on $_ The $_ operator is special; see Chapter 13, "Special Variables," for full details. In many ways, $_ is the default container for data that is being read in by Perl. The <> operator, for example, gets the next line from STDIN and stores it in $_. So the following code snippet allows you to type lines of text and tells you when your line matches one of the regular expressions:
$prompt = "Enter some text or press Ctrl+D to stop: "; print $prompt; while (<>) { /^[aA]/ && print "Starts with a or A. "; /[0-9]$/ && print "Ends with a digit. "; /perl/ && print "You said it! "; print $prompt; }
Bound Matches Matching doesn't always have to operate on $_, although this default behavior is quite convenient. A special operator, =~, evaluates to either true or false, depending on whether its first operand matches on its second operand. So $filename =~ /dat$/ is true if $filename matches on /dat$/. You can use =~ in conditionals in the usual way, as follows:
?$filename =~ /dat$/ && die "Can't use .dat files.\n";
A corresponding operator, !~, has the opposite sense. !~ is true if the first operator does not match on the second, as follows:
$ENV{'PATH'} !~ /perl/ && warn "Not sure if perl is in your path ";
Alternative Delimiters The match operator can use characters other than //-a useful point if you're trying to match a complex expression that involves forward slashes. A more general form of the match operator than // is m//. If you use the leading m, you can use any character to delimit the regular expression. For example,
$installpath =~ m!^/usr/local! || warn "The path you have chosen is odd.\n";
warns that "The path you have chosen is odd.\n" if the variable $installpath starts with /usr/local.
Match Options You can apply several optional
switches to the match operator (either // or m//)
to alter its behavior. These options are listed in Table 1.9.
Meaning | |
Perform global matching | |
Perform case-insensitive matching | |
Evaluate the regular expression one time only |
The g switch continues matching even after the first match has been found. This switch is useful when you are using backreferences to examine the matched portions of a string, as described in the "Backreferences" section later in this chapter.
The i switch forces a case-insensitive match.
Finally, the o switch is used inside loops in which a great deal of pattern matching is taking place. This switch tells Perl that the regular expression (the match operator's operand) is to be evaluated one time only. The switch can improve efficiency when the regular expression is fixed for all iterations of the loop that contains it.
Backreferences As we mentioned in the "Backslashed Tokens" section earlier in this chapter, pattern matching produces quantities that are known as backreferences. These quantities are the parts of your string in which the match succeeded. You need to tell Perl to store them by surrounding the relevant parts of your regular expression with parentheses, and you can refer to them after the match as \1, \2, and so on. The following example determines whether the user typed three consecutive four-letter words:
while (<>) { /\b(\S{4})\s(\S{4})\s(\S{4})\b/ && print "Gosh, you said $1 $2 $3!\n"; }
The first four-letter word lies between a word boundary (\b) and some white space (\s), and consists of four non-white-space characters (\S). If there is a match on the expression \b(\S{4})\s-if a four-letter word is found-the matching substring is stored in the special variable \1, and the search continues. When the search is complete, you can refer to the backreferences as $1, $2, and so on.
What if you don't know in advance how many matches to expect? Perform the match in an array context; Perl returns the matches in an array. Consider this example:
@hits = ("Yon Yonson, Wisconsin" =~ /(\won)/g); print "Matched on ", join(', ', @hits), ".\n";
We'll start at the right side and work backward. The regular expression (\won) means that we match any alphanumeric character followed by on and store all three characters. The g option after the // operator means that we want to do this for the entire string, even after we find a match. The =~ operator means that we carry out this operation on a given string (Yon Yonson, Wisconsin). Finally, the whole thing is evaluated in an array context, so Perl returns the array of matches, and we store it in the @hits array. Following is the output from this example:
Matched on Yon, Yon, son, con.
When you get the hang of pattern matching, you'll find that substitutions are quite straightforward and very powerful. The substitution operator is s///, which resembles the match operator but has three rather than two slashes. Just as you can do with the match operator, you can substitute any other character for the forward slashes, and you can use the optional i, g, and o switches.
The pattern to be replaced goes between the first and second delimiters, and the replacement pattern goes between the second and third delimiters. This simple example changes $house from henhouse to doghouse:
$house = "henhouse"; $house =~ s/hen/dog/;
Notice that it isn't possible to use the =~ operator with a literal string as you can when matching, because you can't modify a literal constant. Instead, store the string in a variable and modify that variable.
You have reached the end of your whirlwind tour of Perl. You saw how Perl's deceptively simple constructs can be used to write deceptively simple programs, and you got a brief look at the basic elements of the language. At minimum, you should have a clear idea of how the language works, and you should know where to go for more information on Perl as the need arises throughout the rest of this book.
This book now moves on to Web matters, but look in the following places for more information about Perl: