Chapter 3 Programming with Perl

CONTENTS

Regular Expressions
Chapter in Review

To program with Perl, we can take variables, assign them values from within the program or user input, and then manipulate them. Before we can get right into Perl programming, we have to get another central concept under our belt: regular expressions, which are dealt with in detail in this chapter.

Regular Expressions

Regular expressions are a concept from the UNIX world. They are a like a pattern, or template, which is matched against a string. Strings are sequences of characters, like "hello" or "please pass the butter." When a regular expression tries to make a match, it either succeeds or fails. The regular expression is not a literal translation of the string, but a representation of it.

Think of a regular expression as being like a verbal expression in slang. When the guys are hanging out and one of them calls to another, "Yo, Homie! You look fat today!" he is not referring to a weight problem his friend may be having. "Fat" is a slang term, and means "looking good" or "your appearance is exceptional."

The key to using regular expressions is two-fold. The first thing you must understand is the pattern you are trying to match. The second is to understand the different patterns available to you to make a pattern match.

Regular expressions are used in many different operating systems, and by many different programs and processes. If you are already familiar with regular expressions in another context, then you're in luck. While the syntax of regular expressions vary between operating systems, the concepts remain the same.

Before getting into the nitty-gritty of regular expressions, it might help if we looked into some related programming issues in Perl. The issues include control structures, associative arrays, and data I/O using <STDIN>.

Perl Control Structures

It is important to be able to tell the Perl interpreter when you want things done in your script. To do this you use control structures, like a statement block, or different kinds of loops.

The Statement Block

The simplest control structure in Perl is the statement block, which is made up of a series of statements that are enclosed in curly braces, and might look something like this:

{
   $one = "1";
   @two = (1,2,3);
   %three = $two[Ø};
}

where the statements inside the statement block are indented one tab past the curly braces.

When Perl encounters a statement block it executes each statement consecutively, starting with the first, and working its way to the last. Perl will treat the entire block as a single statement in the script as a whole.

Statement blocks are often used as part of the syntax of statement loops.

The If/Unless Statement Loop

In an if/unless loop a designated expression is examined for truth, and if it is true, then one series of events is started. If it is false, then another path is taken in the script. A simple format for an if/unless statement loop is

if (the_expression)	{
statement_if_expression_true;
} else {
statement_if_expression_false;
}

where Perl will evaluate the _expression, called the control expression, to see if it is true or false. If it is true then it goes to the statement in the first block following the if command. If the control expression is false, Perl goes to the statement following the else command.

Applying this we can get a script that asks for some user input, then evaluates it with different possible outcomes depending on the input received, like this:

print "What is the temperature?";
$temp = <STDIN>;
chop ($temp);
if ($temp < 7Ø)	{
print "Brrr, you better get a sweater!\n";
} else {
print "Is it hot enough for ya?\n";
}

If you want to return a statement only if the result of the expression test is false, you can use the unless command:

print "What is the temperature?";$temp = <STDIN>;
chop ($temp);
unless ($temp < 7Ø)	{
print "Is it hot enough for ya?\n";
}

so that the user will receive the print statement unless their input is determined to be false.

Another option in an if/else loop is the elsif command. You may want to have several choices for the script's execution, so you include the elsif command to handle the other options, like this:

print "What is the temperature?";$temp = <STDIN>;
chop ($temp);
if ($temp < 7Ø)	{
print "Brrr, you better get a sweater!\n";
} elsif (7Ø < $temp < 8Ø)	{
print "A little cool, but comfortable.\n";
} elsif (8Ø < $temp < 9Ø)	{
print "Nice and cozy./n";
} else {
print "Is it hot enough for ya?\n";
}

where you are not limited in your options with elsif, and can have as many of these branch control structures as you need.

The While/Until Loop Statement

There may come a time (probably sooner rather than later) where you'll need to have a block of statements repeatedly read until a certain condition is met. This is done with the while/until loop statement.

A typical use for this loop is to count something down, like:

print "How high for your countdown?"; # where
# the user sets the upper limit
# of the countdown
$count = <STDIN>;
chop ($count);
while ($count > Ø)	{
print "T minus $count, and counting...\n";
$count--;
}

where the while loop is executed until the value of $count is equal to 0. You may have noticed the use of the autodecrement operator on $temp to lower the value each cycle.

The while loop also has an option to return a statement if the condition of the input is false, called the until command. Used in the same way, until looks like this:

print "How long for your countdown?";
$count = <STDIN>;
chop ($count);
until ($count > Ø)	{
print "Lift off!\n";
$count--;
}

The print statement is not executed until the condition of $count is satisfied.

Another way to repeat, or iterate, a statement block is with the for and foreach commands.

The For/Foreach Loop Statements

When you need a script to evaluate an expression and then re-evaluate it in a countdown fashion, you can use the for command like this:

for ($count = 15; $count >= 1; $count--)	{
print "$count \n";
}

The countdown is printed from 15 to 1, with each number appearing on a new line. $count is given the value 15, tested against the condition of being >=1, printed, then autodecremented and looped. When $count = 1, then loop dies after printing the final value for $count.

There may be an instance where you need to create a loop with a variable in which its value will change in the loop, but you need it restored after the loop dies. You need the variable to be local to the loop. You can do this with the foreach command.

With foreach, a list of values is created and then it places them into a scalar variable one at a time, and then computes that statement block designated by the foreach command. An example of this might be:

@letters = ("A","B","C","D");
foreach $new (reverse @letters){
print $new;
}

where the output will be D C B A. A special Perl variable can be used here to simplify the code. The $_ scalar variable is the default variable with many commands, like foreach, into which values can be placed. The same script above would look like this:

@letters = ("A","B","C","D");
foreach $_ (reverse @letters)	{
print $_;
}

and this can be shortened even more, because Perl will see the $_ variable, even if it is left out, like this:

@letters = ("A","B","C","D");
foreach (reverse @letters)	{
print ;
}

The foreach statement is also handy if you want to change the values of an entire array. It works like this:

@numbers = (2,4,6,8);
foreach $two (@numbers)	{
$two *= 2;
}

which give @numbers the new element values of (4,8,12,16). Just by changing the scalar $two you can change the entire array @numbers.

Associative Arrays

In Perl, associative arrays take the place of other recursive data types, like trees, that are used in other computer languages.

These kinds of arrays are very similar to the list kind of arrays discussed in Chapter 2 The main difference between them is that a list array has index values for its elements, which start at 0 and increment by whole numbers to the end of the array, whereas the associative array uses arbitrary scalars, also called keys.

What this means is that you are not limited to referring to an element in an array by its integer-based index value, but you can use whatever value you choose to associate with the array element. It is quite normal, and quite desirable, to associate strings with particular array elements in this way.

A good model for getting a better understanding of how associative arrays work might be a Rolodex. New names and phone numbers are written on separate cards and filed in the Rolodex's alphebetical sections. When you want to find out the number of a new friend, you go to the letter your friend's name is filed under to find the number.

With associative arrays, the new names and phone numbers are scalar values, while the letters on the Rolodex cards themselves are the keys. To find the values, you look for the key to that value in the associative array, like looking for the phone number using the letters.

Associative arrays use a variable in this format:

%variable_name

Unlike array list variables, associative arrays are usually referred to with their keys. Let's look at one:

%month = (
"'Jan', 'January',"
"'Feb', 'February',"
"'Mar', 'March',"
"'Apr', 'April',"
"'May', 'May',"
"'Jun', 'June',"
"'Jul', 'Jul',"
"'Aug', 'August',"
"'Sept', 'September',"
"'Oct', 'October',"
"'Nov', 'November',"
"'Dec', 'December',"
);

where single and double-quotes have the same powers they did in array lists.

If you want to subscript an associative array, the process is similar to subscripting an array list, but curly braces replace the square brackets, so:

# a simpe array
@wolf = (4,5,6);
$wolf[3] = "moon"; # making @wolf
# now (4,5,6,"moon")
$wolf[5] = "howl"; # producing
# (4,5,6,"moon",undef,"howl")

becomes this as an adapted associative array:

%wolf = (4,"four",5,"five",6,"six");
$wolf{4} = "moon"; # making %wolf
# now (4,"moon",5,"five",6,"six")
$wolf{5} = "howl"; # producing %wolf values
# (4,"moon",5,"howl",6,"six")

Note that there are always an equal number of elements in an associative array. There has to be a value and its key, otherwise the missing value or key is given the undef value.

When you call up an associative array's values there is no literal equivalent as there is with array lists. Instead, Perl creates a string of key/value pairs in whatever order is easiest at the time, depending on the phases of the moon. (Just kidding! Perl creates the literal list of pair values based on what is fastest at that moment based on where you are in the Perl script, so each time you determine a literal list, the order will be slightly different.)

Associative arrays also have their own operators.

The Keys Operator

To get a list of all the current keys of an associative array, put the array name in parentheses behind the operator keys, like this:

keys(%wolf); #which would create
# the key/value list (4,5,6,), or
# something similar since the order
# does not have to remain the same

where the use of parentheses is an option, and can be used as you see fit.

The Values Operator

As you might imagine, the values operator works like the keys operator, except that it returns a list of values of an associative array

values(%wolf); # creates 
# ("moon","howl","six")

in no particular order.

The Each Operator

To inspect all the elements of an associative array, you can use the each operator like the other

each(%wolf);

which would return the first key/value pair in the array:

% 4,moon,5,howl,6,six

The Delete Operator

This is the operator which allows you to remove key-value pairs by designating the key of the pair you want removed:

%wolf = (4,"four",5,"five",6,"six");
delete $wolf{5};
# returns an element list for
# %wolf of (4,"four",6,"six")

I/O

We already know that you can use <STDIN> to take a line of user input and store it as a scalar variable value. We also know that you can put a collection of user input from <STDIN> into an array, where each line entered is kept as a separate element in the array key/value pairs. But we haven't yet really focused on how to manipulate these values.

Perhaps you want to go through each line of text and change some of them. You might do this by creating a simple loop, like this one:

while ($_ = <STDIN>) {
}

where the command while creates a repetition loop. Perl will keep putting the lines in <STDIN> into the scalar variable $_ until the file runs out of lines and the loop dies.

When using Perl output, both the familiar print and related printf operators can be used to write to <STDOUT>. The print operator can do more than just produce text to display; this operator acts like a list operator and can move strings into <STNOUT> without adding any characters.

The printf operator gives you more command over your strings than print does. With printf you can designate a format control string with the first argument, which will determine how the rest of the arguments will be printed. An example might be:

printf "%1Ø %2Ø", $wolf, $moon;

where $wolf will be printed to a 10-character field and $moon will be printed to a 20-character field.

There are other modifiers that can be used with the printf operator that can designate the string to be printed as a decimal integer, floating-point value, or with spaces or tabs.

Using a Regular Expression

To demonstrate a simple use of a regular expression, I want to introduce you to a simple command called grep, short for global regular expression print. Grep is a very powerful command that has come from the UNIX world. With grep you can take a regular expression and search a file line by line, trying to match the string indicated in the regular expression. Using grep at the command line might look like this:

grep crypt bonus.pl
% bonus_change.pl

where grep will examine every line of code in bonus.pl to see if it contains the string "crypt," and then output those lines into the file bonus_change.pl via <STDOUT>.

To denote the string "crypt" as a regular expression in Perl, it is enclosed between slashes like

/crypt/

and would look like this in a script:

if (/crypt/) {
print "$_";
}

where the regular expression crypt is tested against the special variable $_, the default scalar variable where everything has been stored at one time or another.

There are a number of special variables in Perl, like $_, that have their own designated features designed to make Perl easier to use, and they are touched on throughout the book where most appropriate.

The Guestbook

Let's start with a simple example of a guestbook using WinPerl. First you'll want to create a Perl file called guest.pl.

NOTE

You can write Perl scripts in any text processing application, like Notepad, Write, or Microsoft Word, if you so desire. Whatever you use to create your scripts, remember to save your script as text only, and with the .pl file extension.

In keeping with good programming practice, the first line of the script will be a comment line stating the name of the file. This is very handy when you, or someone else, may have to work with the file later on. The script looks like this:

#! usr/bin/perl
# guest.pl
print "What is your Name? ";
$name=<STDIN>; # get the response
# from the user
open (GUESTBOOK, ">>guest.pl");
# open a file with
# filehandle GUESTBOOK
print GUESTBOOK "$name"; # append
# the name to the guestbook file
chop($name); # remove the newline
print "Thank you, $name! Your name has been added to the Guestbook.\n";
close(GUESTBOOK);

So, what's going on here? We first print the prompt for the user. The print command will default to <STDOUT>, in this case, the screen.

We then use the <> construct to retrieve a line from a filehandle, <STDIN>, which is the keyboard input up to a newline. We next open a file called guest.pl, which is assigned a filehandle of GUESTBOOK. The filehandle is used to reference the opened file for reading or writing and it is usually all capitals. This filehandle, like all variables or arrays in Perl, can be just about anything you like, as long as it isn't a reserved word. You might notice that we put >> in front of the filename. This means we are opening the file for appending. The options for file opening are in Table 3.1.

Table 3.1 File Opening Options

Operator	Action
>	Write
>>	Append
+>	Read and Write
Nothing	Read

Once the file is open, and we have a filehandle, we can put the name into the guestbook file with a print command. Notice that we do this by putting the filehandle directly after the print statement. In fact, the normal print "string" command is actually a short version of

print <STDOUT> "string";

We print the next line to the screen. We have put the variable $name inside the string, and it will be printed as we entered it. This is called variable interpolation; the variable is replaced in the string before it is printed.

The chop() command called before the print removes the last character in the string $name. This is used to remove the newline that is appended to the string when it is entered from the keyboard. If we didn't do this, we would end up with an extra newline printed right after the name, and the exclamation point on a line by itself. Not such a good idea, grammatically speaking. We left the newline on when we printed it to the file because we wanted each name on its own line.

Finally, we close the file and end the program. We don't actually need to close the file, as Perl will close it automatically if it is reopened or the program ends, but it is good programming practice to put it there.

Okay, so now we have a database full of the names of our guests. What next?

Well, the next logical step would be to check the file to see if you have already visited so the database isn't filled up with repeat customers. The procedure shown in Listing 3.1 shows you how.

Listing 3.1 Checking the Database for Repeat Visitors

Checking the file chop($name); # remove the newline
print "Thank you, $name! Your name has been added to the Guestbook.\n";
close(GUESTBOOK);
print "What is your Name?";
$name=<STDIN>; # get the response
# from the user
open (GUESTBOOK, "guest.pl");
# open a file with filehandle
# GUESTBOOK
while ($line=<GUESTBOOK>) {
        if ($line eq $name) {
                print "Your name is already in the guestbook!\n";
                close(GUESTBOOK);
              exit;
     }
}
close (GUESTBOOK); # close file
# for read
open (GUESTBOOK, ">>guest.pl");
# open same file for append
print GUESTBOOK "$name"; # append
# the name to the guestbook file
chop($name); # remove the newline
print "Thank you, $name! Your name has been added to the Guestbook.\n";
close(GUESTBOOK);

Let's have a look at the new part. The while loop will continue until the condition is false. The condition in this case is assigning the variable $line to each line of GUESTBOOK, terminated with a newline.

This is the procedure we did when we got the name of the guest, except we are now obtaining it from a file rather than <STDIN>. This condition will continue to be true until there are no more lines in the file. Notice that we opened the file guest.pl with a read, so we can get the lines.

The if condition compares each $line with $name. We use the operator eq here because the variables are strings. If they were numbers, we would use a different set of comparison operators. Because Perl makes no distinction between string variables and numeric variables, we must be cautious as to which comparison operators we use. Perl will compare "5" and "10" differently depending on whether you use a string or numeric operator. "5" will be less than "10" in a numeric sense, but greater than "10" in a string context. Table 3.2 lists the possible operators in both string and numeric context.

Table 3.2 Numeric versus String Operators

Numeric Operator	String Operator	Definition
==	eq	Equal to
!=	ne	Not equal to
>	gt	Greater than
>=	ge	Greater than or equal to
<	lt	Less than
<=	le	Less than or equal to
<=>	cmp	Not equal to, with numeric return

A quick note on the compare operator: It will return a -1 if it is less than, and a +1 if it is greater than.

The exit command within the if body will stop the program after informing the user that his or her name is already in the guestbook. We could have acomplished the same thing in one line with the die command:

if ($line eq $name) {
        die "Your name is already in the guestbook!\n";
}

This is useful for command line programs, but does not work very well with CGI scripts, so it isn't used very often.

If the name isn't found in the GUESTBOOK file, we exit the while loop, and continue on with the program. We have to close the file and reopen it in append mode so that we can add the name to the end of the file. Once again, strictly speaking, we don't have to close the file before reopening it, but it is good practice so that we develop a sense of what is happening, and when, in our scripts. When dealing with larger programs it can get a little confusing as to what is happening at a particular point in the code.

Although we used ($line eq $name) in the above example, this is not necessarily the best way to test equivalency. In this case, the two strings must be exactly equal, which means "john" and "John" are not equal. Also, if there happens to be a newline in one string and not the other, then Perl will call them differently. To get around some of these nuisances, we use something called a regular expression.

Regular expressions in Perl come in very handy, as they are much less cumbersome to use and a lot more flexible then string searching and comparison in some other languages (which will remain nameless). Let's look at an equivalent expression to the one used above:

if ($line =~ /john/) {
# do some stuff...
}

There are a few things going on here. First, we are using a new operator.

The =~ operator is the regular expression operator. We will use it a lot to do all types of searching and replacing. Here we use it to compare $line with the expression between the /'s. If the variable $line contains the string "john," then the condition is true. This means that the strings will match if $line is equal to "john," "johnathan," "joe johnston," or "john and the beanstalk."

If we want to be sure that the first name begins with "john," we can change the expression slightly to yield

if ($line =~/^john/) {

The ^ character tells Perl that the variable $line must start with "john" in order to match. This will match "john," "johnny," and "john smith" but not "big john" or "joe johnston."

Ok, so what about case? The current expression doesn't do uppercase, so "john" will match, but "John" will not. Another simple change:

if ($line =~/^john/i) {

The "i" after the last / tells Perl to ignore the case of the regular expression so that we now will match "John," "Johnathan," and "JOHN."

Of course, we don't want to match $line to just John, we want to match it to the user input in the variable $name. Well, remember variable interpolation, where a variable name in a string gets substituted for its value before the string gets printed? The same thing applies in regular expressions:

if ($line =~/^$name/i) {

Now the condition is true if $line is a string beginning with $name in any combination of upper- and lowercase. Amazing!

A note on regular expressions: Some of the characters in a regular expression are significant (like the ^) and will not become part of the expression itself.

If you want to match these characters, you must "escape" them. For instance, if you wanted to match the variable $line to "caret^," you would precede the ^ by backslash in the regular expression

if ($line =~/caret\^/) {

For a list of special characters in regular expressions, see Appendix A.

You are also not limited to using the / as a delimeter for a regular expression. If you precede the regular expression by m, the next character becomes the delimeter, so this

if ($line =~/john/) {

is equivalent to

if ($line =~m#john#) {

and

if ($line =~m[john]) {

The delimiter can be just about any character, but notice that in the case of character pairs, like the square brackets ([]), the end delimiter is the opposite mate to the beginning delimiter.

Okay, now we have a database full of names, and we can check them against inputed data, and ignore the case. What next? The next logical step seems to be to glean some more useful information from this program. Let's ask the user for their last name and favorite color as well, as shown in Listing 3.2.

Listing 3.2 Asking for Additional User Information

Asking for details
print "What is your first name? ";
$name=<STDIN>; # get the response 
# from the user
chop($name); # remove the newline
print "What is your last name? ";
$lastname=<STDIN>;
chop($lastname);
print "What is your favorite color? ";
$color=<STDIN>;
chop($color);
$newline=$name.':'.$lastname.':'.$color."\n"; # make line
# delimited with colons
open (GUESTBOOK, ">>guest.pl"); # Open file for append
print GUESTBOOK "$newline"; 
# Append the field line to
# the guestbook file
print "Thank you, $name!  Your name has been added to the Guestbook.\n";
close(GUESTBOOK);

There are a few things different now. We are asking for three seperate pieces of data, and assigning each to a variable. Notice that we are removing the newline character as soon as we get the data. The next thing we do is format a string of data with the three fields separated by a colon, and with a newline character tacked on the end. We'll end up with something like this:

John:Smith:magenta

This will make it easy to add or retrieve the data we want at a later time. The "." operator in Perl is the append operator. To form the string we want, we are appending a colon to the end of $name, adding $lastname to the end of the resulting string, appending another colon, adding $color, then finally appending a newline. Confused yet? Don't worry: as with all things in Perl, there is an easier way. That line is equivalent to

$newline=join(':',$name, $lastname, $color);
$newline.="\n";

The join() function joins the variables or strings listed into one string, separating the fields with the specified delimiter (a : in this case). The .= operator on the following line appends the newline character to the end of the $newline variable. This is equivalent to: $newline=$newline."\n";

Now that we have this information, we'll want to check it. We do this using the split() command (Listing 3.3), the opposite of the join() command. Surprise, surprise.

Listing 3.3 The Split Command

print "What is your first name? ";
$name=<STDIN>; # get the response
# from the user
chop($name); # remove the newline
print "What is your last name? ";
$lastname=<STDIN>;
chop($lastname);
print "What is your favorite color? ";
$color=<STDIN>;
chop($color);
$newline=$name.':'.$lastname.':'.$color."\n"; # make
# line delimited with colons
open (GUESTBOOK, "guest.pl");
while ($line=<GUESTBOOK>) {
($gbname, $gblastname, $gbcolor)=split(':', $line);
if ($gbname=~/^$name/i) {
print "You are already in the guestbook, $name!\n";
close (GUESTBOOK);
exit;
($gbname, $gblastname, $gbcolor)=split(':', $line);}
}
close (GUESTBOOK);
open (GUESTBOOK, ">>guest.pl");
# open file for appending
print GUESTBOOK "$newline";
# append the field line
# to the guestbook file
print "Thank you, $name! Your name has been added to the Guestbook.\n";
close(GUESTBOOK);

Here we assign $gbname, $gblastname, and $gbcolor to the first three items retrieved by the split command. We do this by putting brackets around the variable names to simulate an array. We could have just as easily assigned all the variables to an array like this:

@data=split(':', $line);

and referenced the first three elements in the array as

$data[Ø], 
$data[1], and $data[2]

So now that we have some data to play with, let's do some more tests just for practice. Our program is getting a little long, so in Listing 3.4 we'll only deal with the part that has changed.

Listing 3.4 Testing the Data

while ($line=<GUESTBOOK>) {
($gbname, $gblastname, $gbcolor)=split(':', $line);
        if (($gbname=~/^$name/i) && ($gblastname=~/^$lastname/i)) {
         print "You are already in the guestbook, $name!\n";
                close (GUESTBOOK);
                if ($gbcolor!~/$color/i) {
                  print "You have a different favorite color!\n";
                      print "Your old favorite color is: $gbcolor\n";
                        print "Your new favorite color is: $color\n";
                        print "Would you like to change it? ";
                        $input=<STDIN>;
                        if ($input=~/^y/i) {
                          open(GUESTBOOK, "guest.pl");
                            undef $/;
                             $body=<GUESTBOOK>;
                            $/="\n";
                            close(GUESTBOOK);
                            $body=~s/$line/$newline/;
                             open(GUESTBOOK, ">guest.pl");
                            print GUESTBOOK $body;
                           close(GUESTBOOK);
                             exit;
                        }
                  else {
                       exit;
                  }
            }
        exit;
     }
}

What's happening here? The first thing you may notice is that we are doing an extra test at the first if statement. Since people may tend to have similar first names, we are now testing that the first and last name match.

The && means that the first and second expressions must be true in order for the if statement to be true. Alternatively, || means the first or the second expression must be true. We next check to see if the color is the same. If it is, we just exit. If it isn't, we alert the user, and ask them if they want to change their color choice. We get a line from STDIN as usual, and check to see if it starts with y or Y. If it doesn't, we exit. If it does, that's when the fun starts.

We open guest.pl to read, as normal, but then we undef (undefine) a system variable $/. This variable is the one used to determine where lines end when you read them in from a file. It is normally set to "\n", so you get one line per line in the file. By undefining it, we will now read the entire file (newlines and all) into the variable $body. Once we have the whole thing, we can replace the line with the old color ($line) with the line with the new color ($newline). This is done by using the =~ operator again, but notice that there is an s in front of the first /. This means we are doing a substitution. The expression between the first two /'s will be replaced by the expression between the second two /'s, if it exists. As with all regular expressions, you can use any delimiter you like, so

$body =~ s #$line#newline#;

would have been equivalent. Also, the i directive to ignore case that comes after the last slash can apply here as well. Once we have replaced the line we want, we open the guestbook again for writing, and just write the whole file out with a print, and exit. Remember to redefine $/ before you do any more file operations to make sure that you don't mess up your future operations. But back to our regular expressions.

It's probably a good idea to go at each of the seperate elements covered with this script so there is no doubt as to what regular expression operators are, and how they work.

Unlike the grep command, which looks at all the lines in the designated file, this script only looks at one, the line which is in $_. To include all the lines of a file we need to do this:

while (<>)	{
if (/crypt/)	{
print "$_";
}
}

This loop continues until all lines are checked.

Now say you are checking your own scripts for crypt, and you realize that your typing was a little sloppy in places. Sometimes you slipped and spelled crypt with two p's, as cryppt. You can amend your searching script to

while (<>)	{
if (/cryp*t/)	{
print "$_";
}
}

The asterisk will allow a search and return of crypt, as well as any spellings of crypt with two or more p's.

Once you have matched what you are looking for you might want to replace it with something. To do this, we can use the substitute operator.

You might use this operator if you want to replace one string with another string. The substitute operator has a short-form, s, which looks like this in a statement:

s/crypt/tomb/;

The substitute operator will replace crypt with the replacement string tomb.

Regular expressions, as you can now see, are patterns. These patterns can be as big or as small as you need, each with its own peculiarities. Let's look at some more.

Regular Expressions as Patterns

There are various patterns that regular expressions work with: single-character, grouping, and anchoring. Each of these has its own little characteristics that make it work.

The Single-Character Pattern

The most common pattern-matching character is a single character used to match itself. This would be using a letter as a regular expression to match itself; in other words, regular expression "a" looking for character "a" in a string.

The second most common pattern-matching character is a period or dot, "." This character will match any single character with the exception of the newline operator, /n.

Moving into larger areas, a character class pattern-matching can occur when a set of square brackets are used to enclose the regular expression in question:

/[crypt]/

When a character class is used, a match will occur if any character in the regular expression is found in the strings being tested. It is important to note that regular expressions try to be as accurate as possible, without limiting their scope, so they are very case-sensitive.

One the other hand, only one of the characters in the correct corresponding postion has to be in the regular expression for a match to occur.

You can designate a range with this operator by inserting a dash between the values. For example,

/[Ø-5]/

is the same as

/[Ø12345]/

which can be very powerful if you consider that

/[a-zA-ZØ-9]/

can search for all letters of the alphabet-both upper- and lowercase-as well as all numbers. Not bad for 15 little keystrokes.

If you want to use the character class in the opposite way-for example, to return those matches which are not in the regular expression-then place a caret (^) after the left bracket, like so:

/[^Ø-5]/

This expression matches every single character which is not in the range from 0 to 5. There are some common character classes in Perl which are listed in Table 3.3.

Table 3.3 Character Class Contractions

Construct	Equivalent Class	Negated Construct	Equivalent Negated Class
\d (digits)	[0-9]	\D (anything but digits)	[^0-9]
\w (words)	[a-zA-Z0-9]	\W (anything but words)	[^a-zA-Z0-9]
\s (space)	[ \r\t\n\f]	\S (anything but space)	[^ \r\t\n\f]

Before we get into any more of the guts of Perl, let's apply what we've already exposed ourselves to. We should also start to make note of the little differences between UNIX Perl and Perl for Windows NT, or WinPerl.

One big difference is that while most Perl scripts you will find contain the first line

#! user/local/bin/perl

or something similar, this is unnecessary with WinPerl. This line in UNIX lets the operating system know where to find the Perl interpreter. With Windows NT, you need to associate the .pl file extension with perl.exe for your script to function. Probably associating the .cgi extension is a good idea, too, since most of these files are also written in Perl.

The Grouping Pattern

There are several grouping patterns to understand: sequence, multipliers, parentheses, and alternation. By using grouping patterns you can give your script the ability to put conditions on your regular expression matching. For example, look for six of this, or look for two or more of these.

The Sequence Grouping Pattern

We're already familiar with this: It's where a regular expression matches a string exactly, like

/crypt/

where the regular expression looks for the same sequence of the characters: c r y p t.

The Multipliers Grouping Pattern We already met one of these with the asterisk. The asterisk designates a "zero or more" match with the previous character. The "+" symbol is used to designate the return of matches containing one or more of the previous character. To indicate a match of "zero or one" of the previous character, you would use the question mark, "?." Each of these grouping patterns will choose to match the larger string of those strings it finds.

If you want to stipulate how many characters these grouping patterns are to match, you can use a general multiplier, whose format is

/a{2,4}/

where a is the regular expression we are trying to match, and 2 and 4 are the range of a's which will satisfy our string match, meaning that a match will be found for strings "aa," "aaa," and "aaaa," but not for strings "a" or "aaaaa."

When the general modifier has the second number absent, as with

/a{3,}/

it tells the match to look for three or more of the the letter a. If the comma is absent, as with

/a{3}/

it tells the match to find exactly three a's. To look for three or fewer a's, a zero is used in the range field, like this:

/a{Ø,3}/

If you want to match the conditions of two characters you might try

/a.{3}x/

which will make the regular expression look for any letter a separated by three non-newline characters from the letter x.

The Parentheses Grouping Pattern You can use a pair of open and close parentheses to enclose any part of an expression match you need to have remembered. The part of the expression that is held by the parentheses is the part of the expression that will be kept in memory.

To use this remembered expression match, you use an integer and a backslash, like this:

/moose(.)kiss\1/;

This regular expression will match any occurrence of the string "moose," followed by any two non-newline characters, followed by the string "kiss," followed by any one non-newline character. The regular expression will remember which single non-newline characters it matched with "moose" and look for the same with "kiss." For example,

mooseqkissq

is a match, but

mooseqkissw

is not. This differs from the regular expression

/moose.kiss./;

which will match any two non-newline characters, whether they are the same or not. The "1" between the slashes relates to what's in the parentheses. If there is more than one set of parentheses, you can use the number between the slashes to indicate the one you want remembered, starting from left to right. An example might look like this:

/a(.)p(.)e\1s/;

The first character is "a," followed by the #1 non-newline character, followed by "p," followed by the #2 newline character, followed by "e," followed by whatever the #1 non-newline character is, followed by "s." This will match

aqpdeqs

where the different non-newline characters only have to match their designation, and not each other. To add the ability to match more than a single character with the referenced part, just add an asterisk to the expression, as

/a(.*)p\1e/;

This expression would match "a," followed by any number of non-newline characters, followed by "p," followed by that same series of non-newline characters and then "e." A match might be

aplanetpplanete

but not

aqqpqqqe

You can also use the memory grouping pattern to replace portions of a string. A string like

$_ = "a peas p corn e squash";
s/p(.*)e/b\1c/;

creates the new string value of

a peas b corn c squash

where the "p" and "e" were replaced with "b" and "c," but what was in between remains unchanged.

The Alternation Grouping Pattern The general format for alternation is

a|p|e

where the regular expression is asked to match only one of the designated alternatives, "a," "p," or "e." You can apply alternation to more than one character, so

ape|gorilla|monkey

would be equally valid.

The Anchoring Pattern

To anchor a pattern there are four special notations available. You would want to anchor your regular expression search if you don't want to turn up every instance of a string. For example, when searching for the string "the," you don't want to also get "then," "there," "their," or "them." To do this you might use the word boundry anchor \b:

/the\b/;

so that only those strings ending with "the" are matched. But this doesn't stop a string like "absinthe" from being matched, so you can add a word boundary anchor to the front of the regular expression

/\bthe\b/;

so that only the exact matches of "the" are returned.

If, on the other hand, you wanted to match only those instances which included the string in the regular expression, and not the string itself, you would use the \B anchor

/the\B/;

to return the matches "thee," "these," "absinthe," "there," and "then," but not "the."

The next anchor, \^, is used to match the start of a string only when it is in a place that makes sense to match, as with

/\^the/;

which matches only those strings which start with "the."

The final anchor, \$, works in a similar way but on the end of a string, so

/the\$/;

will match any occurrence of "the" which appears at the end of a string.

Pattern Precedence

As with operators, both grouping and anchoring patterns have an order of precedence to follow. Table 3.4 gives you a quick rundown.

Table 3.4 Pattern Precedence from Highest to Lowest

Name	Representation
Parentheses	()
Mulipliers	+*?{a,b}
Sequence and Anchoring	ape\b\B\^\$
Alternation	\|

Remember, if you use parentheses to clarify a regular expression because it has the higest precedence, you will also be employing its memory of that string. These examples should explain the differences in matches caused by the use of parentheses.

ape*

will match ap, ape, apee, apeee, etc.

whereas

(ape)

will match "", ape, apeape, apeapeape, etc.

and

\^a|b

will match "a" at the start of the line, or "b" anywhere in the line. Yet

\^(a|b)

will match either "a" or "b" at the start of the line, and

a|pe|s

matches "a" or "pe" or "s." If you apply parentheses

(a|pe)(pe|s)

you'll match ape, as, pepe, and pes. These parentheses can be used to find related words like

(soft|hard)wood

where either instances of softwood or hardwood are returned as matches.

A possible use for the matching operators might be a script that looks for a common response to direct a response. You can use the "=~" operator to do this. If you remember, this operator places the object of the expression as the new value. Say you have already filled $_ with a value you need later in the script. Then you could use =~ to make a temporary change of direction. The =~ operator acts like this:

print "Will you be needing anything else?";
if (<STDIN> =~ /^[Yy]/) { # which creates the 
# condition that if the input begins with a 'Y'
# or 'y' that the condition is found true, so
# we proceed to the next line 
print "And what would that be?";
<STDIN>; 
print "I'm sorry, that's just not possible.";
}

where no matter what the user inputs, the response will be the same.

Other Matching Operator Tidbits

There are some other ways to modify your regular expressions. Perl uses the "I" symbol to tell a regular expression to ignore case in matching. In the format

/string_characters/i

you could amend a line from our last example script from this:

if (<STDIN> =~ /^[Yy]/)

to this:

if (<STDIN> =~ /^y/i)

so that the case of the response is not a factor determining response.

If you need to use a regular expression to search through filepaths you would need to include slashes in the expression, and in order to do this, a slash has to be preceded by a backslash to appear only as a character in the string

/^\/usr\/bin\/perl/

(and the regular expression starts to look like a divoted golf course!)

Chapter in Review

In this chapter we started out discussing various Perl control structures like the statement block used to define a specific script action, and the different kinds of loops, like the if/unless loop and the for/foreach loop. These loops can be used to have Perl repeat an action as many times as necessary for the script's operation.

We also covered associative arrays, demonstrating how they differ from arrays by having not just a single value in each element, but a key/value pair. Associative arrays are modified by different operators-like the keys, values, each, and delete operators.

The chapter finished with defining regular expressions as a pattern matching tool used by Perl. Now that you have a general understanding of what regular expressions are, defining them between two slashes, and how they match these definition patterns to script specified data, you can start solving some more interesting tasks with Perl. In the next chapter, we'll marry this guestbook script to a CGI output for the user and look at how Perl interacts with HTML.