It seems like every time you turn around, you run into some code that uses environment variables. Environment variables are certainly integral to making your CGI program work. In this chapter, you will learn all about CGI environment variables and become familiar with the types of environment variables on your server. In addition, you will learn about two programs that let you see the environment variables with which your CGI program is working.
In particular, you will learn about these topics:
How does my program figure out how much data to read? Can I tell what type of browser is calling my CGI program? How can I get the name of the person who called my Web page? What do all these environment variables mean? What are environment variables? STOP!
That one is a good place to start.
You're familiar with variables by now; they are the placeholders for data that can change and data that you want to reference again elsewhere in your program. Well, that's what environment variables are, with one extra feature. That extra feature has to do with a term called scope.
When you set a variable in your CGI program, only your CGI program knows about that variable. In fact, by using the local command in Perl, you can limit the "knowledge" of a variable to the block of code in which you are executing. Just add the local(variable list); command between any enclosing curly braces ({}), and you get variables that only the code in those enclosing braces knows about. Any code outside the block of code or curly braces has no knowledge of the variables inside the block of code.
If you take the program fragment in Listing 6.1 as an example, the print statement on line 4 prints
Mozilla/1.1N (Windows; I; 16bit)
and the print statement on line 6 prints testing scope. The rules of block scope can be summed up as Whatever is defined with the local command is limited in scope to the enclosing code block.
Listing 6.1. A program fragment illustrating block scope.
1: $browser = "testing scope"; 2: { 3: local($browser) = $ENV{'HTTP_USER_AGENT'}; 4: print "$browser \n" ; 5: } 6: print "$browser \n" ;
Why would you want to do this? Well, the most common application is for subroutine parameter passing. By assigning the incoming parameter list to a local variable list, you change from a call by reference to a call by value paradigm. This means that your CGI code can modify the input parameters and not affect the code that called your subroutine. The best advice I can give you is to use local variables-especially in subroutines. You'll find that you save a lot of debugging time as you develop your CGI programs.
Let's get back to environment variables. Remember that the difference we're talking about is file variables versus environment variables and the scope of those environment variables. The scope of environment variables is the process in which they execute.
This means that environment variables are the same for every process started within the same executing shell. Did I lose you with that sentence? I'll try to restate it; I'm trying to avoid the use of the word environment to describe environment variables. Every process or program you start has an environment of data with which it begins. Part of the data the program starts with is the environment variable data. Every process or program you start has the same environment variables available to them.
So enough with explanations. Let's talk some details. If I type env at the UNIX command line, what do I get? The simple answer is that I get the environment variables available to my program when executing from the command line. But first, you might be asking, "Why do I care about what type of environment variables are available from the command line?" You care because you should be testing your CGI programs by first executing them at the command line. This at least gets rid of all the syntax errors.
When you run your CGI program from the command line, however, not all the environment variables your program may need are available. So this is only the beginning of testing your program. In addition to being aware of what is available to your program at the command line, you need to understand what the differences are between command-line environment variables and when someone calls your CGI program from a Web page.
Listing 6.2 shows the environment variables available to my CGI programs from the command line. Probably the most important variable that is different between the command line variables and the CGI environment variables is the Path variable.
Listing 6.2. The environment variables from a user logon.
TERM=vt102 HOME=/usr/u/y/yawp PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11/bin:/usr/andrew/bin:/ usr/openwin/bin:/usr/games:. SHELL=/bin/tcsh MAIL=/var/spool/mail/yawp LOGNAME=yawp SHLVL=1 PWD=/usr/u/y/yawp USER=yawp HOST=langley HOSTTYPE=i386-linux OPENWINHOME=/usr/openwin MANPATH=/usr/local/man:/usr/man/preformat:/usr/man:/usr/X11/man:/usr/openwin/man MINICOM=-c on HOSTNAME=langley.io.com LESSOPEN=|lesspipe.sh %s LS_COLORS=: LS_OPTIONS=-8bit -color=tty -F -T 0 WWW_HOME=lynx_bookmarks.html
You can find the Path environment variable in Listings 6.2 and 6.3, as well as Figures 6.1 through 6.3 (and it's different for each figure). This is very important to you! The Path environment variable defines how your CGI program finds any other data or programs within your server. If your CGI program includes another file, when the Perl interpreter goes to search for that file, it uses the Path environment variable to define the areas it will search. The same is true for system commands or other executable programs you run from within your CGI programs. The Path environment variable tells the system how and where to look for programs and files outside your CGI program.
Figure 6.1 : The CGI environment variables as printed by the Print Environment Variables program.
Figure 6.2 : The CGI environment variables as printed by the Print Environment Variables program.
Figure 6.3 : The CGI environment variables as printed by the Print Environment Variables program.
Let's use the Path environment variable in Listing 6.2 as an example. When you execute a program from the command line, UNIX looks at the Path environment variable. This variable tells UNIX in which directories to look for executable programs and data. UNIX reads the Path environment variable from left to right, so it starts looking in the first directory in the path defined in Listing 6.2. The first directory is /usr/local/bin. If your program can't find what it is looking for there, it looks in the next directory, /usr/bin. Each new directory is separated by the colon (:) symbol. Let's skip everything in the middle and move to the last directory. You might have missed this one, and it's one of the most important. The period (.) at the end of the Path environment variable line is not a grammatical end of sentence; it is a command to the UNIX system. The period, in this context, tells UNIX to look in the current directory. The current directory is the directory in which your CGI program resides.
It's not always desirable to look in the current directory last. If the server begins its search elsewhere first, it might find a program that has the same name as yours and run it instead of your CGI program. Also, it's slower. If the program you want to run is in the current directory and the server has to search through every directory in the Path environment variable before it finds it in the current directory, that's time wasted! Take a look at the Server Side Include Path environment variable in Listing 6.3. Suppose that you're executing a CGI program that uses another CGI program that's in the same directory. The server has to search through every directory until it finds the current directory (.). That's 33 searches before it finds the correct path. Remember that the Path environment variable is used by your operating system to find the programs and data your CGI programs need to execute.
Getting the environment variables on your server is not very difficult. The SSI environment variables in Listing 6.3 are from a single SSI command:
<!--# exec cmd="env" -->
You would think that running an SSI would be the same as running a command from the command line. Obviously, it's not! This is a clear example where you can see the difference between running your command from the command line and running it from within your CGI program.
Listing 6.3. The environment variables from an SSI.
DOCUMENT_NAME=env.shtml SCRIPT_FILENAME=/usr/local/business/http/accn.com/cgibook/chap6/env.shtml SERVER_NAME=www.accn.com DOCUMENT_URI=/cgibook/chap6/env.shtml REMOTE_ADDR=199.170.89.42 TERM=dumb HTTP_COOKIE=s=dialup-3240811768697386 HOSTTYPE=i386 PATH=/home/c/cloos/bin:/usr/local/gnu/bin:/usr/local/staff/bin:/usr/local/X11R5/ bin:/usr/X11/bin: /etc:/sbin:/usr/sbin:/usr/local/bin:/usr/contrib/bin:/usr/games:/usr/ingres/ bin:/usr/ucb:/home/c/cloos/bin: /usr/local/gnu/bin:/usr/local/staff/bin:/usr/local/X11R5/bin:/usr/X11/bin:/etc:/ sbin:/usr/sbin:/usr/local/bin: /usr/contrib/bin:/usr/games:/usr/ingres/bin:/usr/ucb:/usr/local/bin:/bin:/usr/ bin:/usr/X11/bin:/usr/andrew/bin: /usr/openwin/bin:/usr/games:.:/sbin:/usr/sbin:/usr/local/sbin:/usr/X11/bin:/usr/ andrew/bin:/usr/openwin/bin: /usr/games:. SHELL=/bin/tcsh SERVER_SOFTWARE=Apache/0.8.13 DATE_GMT=Friday, 22-Sep-95 13:56:58 CST REMOTE_HOST=dialup-4.austin.io.com LAST_MODIFIED=Friday, 22-Sep-95 08:55:11 CDT SERVER_PORT=80 DATE_LOCAL=Friday, 22-Sep-95 08:56:58 CDT DOCUMENT_ROOT=/usr/local/business/http/accn.com OSTYPE=Linux HTTP_USER_AGENT=Mozilla/1.1N (Windows; I; 16bit) HTTP_AccEPT=*/*, image/gif, image/x-xbitmap, image/jpeg DOCUMENT_PATH_INFO= SHLVL=1 SERVER_ADMIN=webmaster@accn.com _=/usr/bin/env
The next question you should be asking is, "Are the SSI environment variables different from the environment variables available to my CGI program?" Figures 6.1 through 6.3 show listings of the environment variables available when I run a CGI program on my server. Listing 6.4 shows the CGI program for printing these environment variables.
Listing 6.4. A CGI program for printing environment variables.
01: #!/usr/local/bin/perl 02: push(@Inc, "/cgi-bin"); 03: require("cgi-lib.pl"); 04: 05: print &PrintHeader; 06: 07: print "<html>\n"; 08: print "<head> <title> Environment Variables </title> </head>\n"; 09: print "<body>\n"; 10: 11: print <<"EOF"; 12: <center> 13: <table border=2 cellpadding=10 cellspacing=10> 14: <th align=left><h3>Environment Variable</h3> 15: <th align=left> <h3>Contents </h3><tr> 16: EOF 17: foreach $var (sort keys(%ENV)) 18: { 19: print "<td> $var <td> $ENV{$var}<tr>"; 20: } 21: print <<"EOF" 22: </table> 23: </body> 24: </html> 25: EOF
This CGI program is a simple little script that you now should be comfortable reading and understanding. It has a few functions in it that I haven't talked about yet. Because both these functions are useful for lots of other purposes, I'll use this program to introduce them to you. The print environment variable's CGI program uses the Perl sort function and the Perl keys function (I mentioned the keys function in previous chapters). Both these functions are handy tools to have available in your programming toolbox. The keys function enables you to determine how your associative array is indexed, and the sort function puts the array of indexes returned from keys into alphabetical order.
As you can see, the environment variables available to your CGI program are even different from the environment variables available to your SSI programs.
Why is there such a difference? As I said earlier, environment variables are based on the process from which your program executes. The command line, SSIs, and CGI program all have different process environments. The command-line environment is based on your initial logon environment. From the command line, you get a custom environment that you can customize through startup scripts.
Because it is started by your Web server, the SSI environment starts with the environment available to a CGI program. When it executes a UNIX command like "env", however, it also gets the environment available at the command line. This happens because the SSI command must open a command-line process in order to run. So it gets the existing CGI environment variables plus the new environment variables available when it opened the command-line process.
Your CGI program gets its environment from your Web server-in this case, the Apache/0.8.13.
Because each method of printing these environment variables starts with a different executing environment, the environment variables available to each are different.
The keys function is solely for use with Perl's associative arrays. Remember that associative arrays are indexed by strings. This can make programming painful when you are trying to get data out and you are not sure what's in the array. This is clearly the case with the ENV array. You really don't know what's in it. For one thing, the same environment variables are not always available to your CGI program. I'll talk about that in more detail later in this chapter. Of course, Perl makes things easy rather than hard. So there must be a simple way to get the data out of an associative array, even if you don't know what the indexes are.
Anyway, the keys function returns an array or a list (arrays and lists are the same thing as far as Perl is concerned) of the indexes to an associative array. The order of the returned indexes is based on how the associative array first was constructed. You can control the order in which your program sees the returned values by using the sort function, however.
The Perl sort function sorts on an input array. This means that the array input from keys is passed to sort. Sort modifies the array and returns an array alphabetically sorted, from a to z. You can invert the sort order, from z to a, by using the reverse command.
The Print Environment Variables program uses the keys and sort functions on line 17 of Listing 6.4. The keys function is passed the associative %ENV array. It returns a list of all the indexes or keys to the %ENV array. The sort function then sorts the list in alphabetical order.
So far, you've seen how to send environment variables back to you through your Web browser, but what if you want to save those variables on your local computer? You could just use the File Save As function on your browser, of course, but that doesn't format the data in a very usable manner. The other option is to save the data to a local file on your server. That might present a couple of problems for you, though. First, you might not have the privileges you need to write a file to your server. I hope this isn't the case, and I suggest changing servers when you can if you encounter this situation. Not all Server Administrators are as helpful as mine, though.
Second, and more likely, you don't want to have to deal with reading the file on a UNIX system. Heck-you probably would have to Telnet in and then use some arcane editor like emacs or vi.
Instead of this headache, you can use the program in Listing 6.5 to mail your environment variable back to your user account. This program was written by Matthew D. Healy and is available at this URI:
http://paella.med.yale.edu/~healy/perltest
This example has lots of useful potential for you. First, it shows you how to use the mail program. I go into detail on mailers in Chapter 11, "Using Internet Mail with Your Web Page," but this is a nice introduction. Second, this program shows you your environment variables URI encoded and decoded. This makes a great reference for the future. Third, you obviously can adapt this program to other purposes.
As you go though this program, you will learn about Perl subroutines and how they receive and return variables, call-by-reference and call-by-value parameter passing, and the Perl special variables $_, @_, and |.
Listing 6.5. A CGI program for mailing environment variables.
001: #!/usr/local/bin/perl 002: 003: #perltest.p 004: #for testing cgi-bin interface 005: # Put this in your cgi-bin directory, changing the e-mail address below... 006: 007: #sub to remove cgi-encoding 008: sub unescape { 009: local ($_)=@_; 010: tr/+/ /; 011: s/%(..)/pack("c",hex($1))/ge; 012: $_; 013: } 014: 015: # ------------------------------------------------------------------------- 016: # The escape and unescape functions are taken from the wwwurl.pl package 017: # developed by Roy Fielding <fielding@ics.uci.edu> as part of the Arcadia 018: # project at the University of California, Irvine. It is distributed 019: # under the Artistic License (included with your Perl distribution 020: # files). 021: # ------------------------------------------------------------------------- 022: 023: #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 024: #.PURPOSE Encodes a string so it doesn't cause problems in URL. 025: # 026: #.REMARKS 027: # 028: #.RETURNS The encoded string 029: #-------------------------------------------------------------------------- 030: 031: sub cgi_encode 032: { 033: local ($str) = @_; 034: $str = &escape($str,'[\x00-\x20"#%/+;<>?\x7F-\xFF]'); 035: $str =~ s/ /+/g; 036: return( $str ); 037: } 038: 039: # ========================================================================= 040: # escape(): Return the passed string after replacing all characters 041: # matching the passed pattern with their %XX hex escape chars. 042: # Note that the caller must be sure not to escape reserved URL 043: # characters (e.g. / in path names, ':' between address and port, 044: # etc.) and thus this routine can only be applied to each URL part separately. E.g. 045: # 046: # $escname = &escape($name,'[\x00-\x20"#%/;<>?\x7F-\xFF]'); 047: # 048: sub escape 049: { 050: local($str, $pat) = @_; 051: 052: $str =~ s/($pat)/sprintf("%%%02lx",unpack('C',$1))/ge; 053: return($str); 054: } 055: 056: #now the main program begins 057: 058: #testing environment variables passed via URL... 059: print "Content-type: text/plain","\n"; 060: print "\n"; 061: 062: open (MAIL,"| mail name@foo.edu") || 063: die "Error: Can't start mail program - Please report this error to name@foo.edu"; 064: 065: 066: print MAIL "Matt's New cgi-test script report","\n"; 067: print MAIL "\n"; 068: print MAIL "\n"; 069: print MAIL "Environment variables" ,"\n"; 070: print MAIL "\n"; 071: 072: foreach(sort keys %ENV) #list all environment variables 073: { 074: $MyEnvName=$_; 075: $MyEnvValue=$ENV{$MyEnvName}; 076: $URLed = &cgi_encode($MyEnvValue); 077: $UnURLed = &unescape($MyEnvValue); 078: print MAIL $MyEnvName,"\n"; 079: print MAIL "Value: ",$MyEnvValue,"\n"; 080: print MAIL "URLed: ",$URLed,"\n"; 081: print MAIL "UnURLed: ",$UnURLed,"\n"; 082: print MAIL "\n"; 083: } 084: 085: if ($ENV{'REQUEST_METHOD'} eq "POST") 086: {#POST data 087: 088: print MAIL "POST data \n"; 089: 090: for ($i = 0; $i < $ENV{'CONTENT_LENGTH'}; $i++) 091: { 092: $MyBuffer .= getc; 093: } 094: 095: print MAIL "Original data: \n"; 096: print MAIL $MyBuffer,"\n"; 097: print MAIL "unURLed: \n"; 098: print MAIL &unescape($MyBuffer), "\n\n"; 099: 100: @MyBuffer = split(/&/,$MyBuffer); 101: 102: foreach $i (0 .. $#MyBuffer) 103: { 104: print MAIL $MyBuffer[$i],"\n"; 105: print MAIL "FName:",&unescape($MyBuffer[$i]),"\n"; 106: } 107: } 108: 109: 110: close ( MAIL ); 111: 112: print "\n"; 113: print "Thanks for filling out this form !\n"; 114: print "It has been sent to name@foo.edu\n<p>\n";
The program in Listing 6.5 is nicely segmented into several smaller subroutines. Subroutines break your logic up into smaller reusable pieces. You've seen this with the ReadParse function. It is a good habit to get into, and I highly recommend it.
This program has all its subroutines defined first, followed by the main program statements. The convention of declaring subroutines first comes from using compilers that require you to declare and/or define subroutines before you use them. You do not have to do this in Perl.
I prefer to define all my subroutines last. That way, the main program logic is always at the top of the file and easy to find. Anyway, if you use Perl, a subroutine can be defined anywhere in your CGI program. Perl treats the subroutine definition as a non-executable statement and just doesn't care where it finds it in your program.
When your program is compiled into memory, Perl builds a cross-reference table so that it can find all the subroutines you have defined. You therefore can call your subroutines regardless of where you define them.
All the parameters passed to your subroutine are in the special Perl variable @_. This array actually references the locations of the passed-in variables. So, if you change something in the @_ array, you are changing the contents of the passed-in parameters. This type of parameter passing is called pass by reference because any use of the variables in your subroutine actually references and modifies the passed parameters.
Usually, it is considered a smart idea to use another form of parameter passing: pass by value.
With this form of parameter passing, all the modifications to your subroutine's parameters are local to the subroutine. This means that the parameters have a scope local to the subroutine.
A convention has developed with Perl that simulates pass by value. If you use the local fun-ction, you create variables in which the scope is local to the subroutine. You often will see the first line of a subroutine as the local call. Then the subroutine operates on the variables defined in the local command. Each of the subroutines in this mail program contains a local command.
Finally, Perl subroutines act differently than most other languages in one important way. The result of the last line evaluated in the subroutine is returned automatically to the calling routine.
As you can see, the last line of the subroutine unescape, repeated in Listing 6.6, takes advantage of this by having Perl evaluate the $_ variable. The side effect of this is that the local copy of $_ is returned to the calling subroutine. If you want to explicitly state the return value, you can do so by using a return statement.
Listing 6.6. The subroutine unescape.
1: #sub to remove cgi-encoding 2: sub unescape { 3: local ($_)=@_; 4: tr/+/ /; 5: s/%(..)/pack("c",hex($1))/ge; 6: $_; 7: }
Okay, let's take a closer look at the subroutines in this program. The subroutine unescape converts the URL-encoding input parameter much like ReadParse. The tr function is a built-in function and works much like the built-in s function. The tr stands for translate, and s stands for substitute.
The tr function translates all occurrences of the characters found in the search pattern to those found in the replacement list. So, in this case, it replaces every plus sign (+) with a space.
Substitute performs exactly the same function, but in its own way. I discussed substitute earlier, and I don't think it deserves a rehash here.
Perl has lots of different functions in it. Some of your choices are based on familiarity. In this case, using tr in unescape or s in ReadParse is not significantly different.
Line 5 of Listing 6.6,
s/%(..)/pack("c",hex($1))/ge;
is the same as ReadParse. The difference you might notice about this function is the use of the $_ character. A lot of people find using the $_ variable confusing-at least initially. In case you are confused about what these functions are modifying, it is the $_ variable. This variable is the underlying variable or default for lots of Perl functions.
This code makes its own local copy from the input array @_ on line 3 of the globally scoped $_ variable and then returns the local copy on the last line.
One final note about subroutines: If no parameters are passed to the subroutine, the @_ array takes on the last value of the $_ variable.
Now let's take a brief look at the cgi_encode subroutine, repeated in Listing 6.7 for convenience. It passes that strange-looking parameter with all the xs and pound signs (#) in it. What is it doing? Well, it's telling the escape routine to look for all the hexadecimal numbers between 00 and 20 and 7F and FF. These numbers are outside the boundaries of normal, printable ASCII characters. It also says to look for special characters like percent signs (%), single quotation marks ('), question marks (?), and so on.
Listing 6.7. The subroutine cgi_encode.
1: sub cgi_encode 2: { 3: local ($str) = @_; 4: $str = &escape($str,'[\x00-\x20"#%/+;<>?\x7F-\xFF]'); 5: $str =~ s/ /+/g; 6: return( $str ); 7: }
The escape routine does the opposite of the decode routine. It just converts all these special characters to their hexadecimal number equivalents. It does this using the substitute function and the unpack function. Unpack just works like a reverse pack function. (The pack function was covered in Chapter 5 "Decoding Data Sent to Your CGI Program.")
Now that you understand all the subroutines, the main program is a snap. I have repeated the main program in Listing 6.8 so that you don't have to switch back and forth between pages. This means that most of the program was duplicated, but I personally like seeing the entire program in a book. That way, when I look at the program, I can see how everything fits together.
Listing 6.8. The main program for mailing environment variables.
01: #now the main program begins 02: #testing environment variables passed via URL... 03: print "Content-type: text/plain","\n"; 04: print "\n"; 05: 06: open (MAIL,"| mail name@foo.edu") || 07: die "Error: Can't start mail program - Please report this error to name@foo.edu"; 08: 09: print MAIL "Matt's New cgi-test script report","\n"; 10: print MAIL "\n"; 11: print MAIL "\n"; 12: print MAIL "Environment variables" ,"\n"; 13: print MAIL "\n"; 14: 15: foreach(sort keys %ENV) #list all environment variables 16: { 17: $MyEnvName=$_; 18: $MyEnvValue=$ENV{$MyEnvName}; 19: $URLed = &cgi_encode($MyEnvValue); 20: $UnURLed = &unescape($MyEnvValue); 21: print MAIL $MyEnvName,"\n"; 22: print MAIL "Value: ",$MyEnvValue,"\n"; 23: print MAIL "URLed: ",$URLed,"\n"; 24: print MAIL "UnURLed: ",$UnURLed,"\n"; 25: print MAIL "\n"; 26: } 27: 28: if ($ENV{'REQUEST_METHOD'} eq "POST") 29: {#POST data 30: print MAIL "POST data \n"; 31: for ($i = 0; $i < $ENV{'CONTENT_LENGTH'}; $i++) 32: { 33: $MyBuffer .= getc; 34: } 35: 36: print MAIL "Original data: \n"; 37: print MAIL $MyBuffer,"\n"; 38: print MAIL "unURLed: \n"; 39: print MAIL &unescape($MyBuffer), "\n\n"; 40: @MyBuffer = split(/&/,$MyBuffer); 41: foreach $i (0 .. $#MyBuffer) 42: { 43: print MAIL $MyBuffer[$i],"\n"; 44: print MAIL "FName:",&unescape($MyBuffer[$i]),"\n"; 45: } 46: } 47: 48: close ( MAIL ); 49: print "\n"; 50: print "Thanks for filling out this form !\n"; 51: print "It has been sent to name@foo.edu\n<p>\n";
Don't forget that the first line of code executed by Perl for the entire program begins after the comment about testing environment variables. Printing the content type with two newlines is the first code output by the program.
The rest seems kind of anticlimactic. A filehandle is opened. The filehandle is named Mail. From this point, every print command sends data to the UNIX mail program.
Each of the environment variables is encoded and decoded and then mailed to your username. You get to see the environment variable in each of its three formats:
Next, on lines 28-34, you can see how to check for and read Post data.
This is a simple for loop. It reads one character at a time, using the getc function, reading from the STDIN filehandle. Remember that Post data always is available at STDIN. You saw this handled differently in the ReadParse function. ReadParse read the entire input string in one line:
read(STDIN,$in,$ENV{'CONTENT_LENGTH'});
Using a for loop and reading one character at a time works also, though, and it looks a lot more like traditional coding languages. The Post data then is encoded and decoded just like the environment data.
This stuff actually becomes pretty easy to understand if you just step through it one line at a time.
There is one bit of Perl magic here that I want to bring out. It's the vertical bar (|) used in the open statement. The vertical bar (|) used in an open command before the filename tells Perl that you want to send all your output data to a system command and not a file.
This makes your job of sending mail messages easy and very safe. By opening the mail program with the parameter name@foo, you told the mail program where you wanted to send the data. Anything sent to the mail program after the initial open statement is sent in the body of the mail message. Because everything is sent in the body of the mail message, any offensive hacker commands can never reach the command line. There is no concern about hacker commands getting to the UNIX shell and wreaking havoc.
Don't forget to close your filehandle Mail. This flushes the output buffer and initiates the sending of the mail.
Remember to change the line that opens up the mail account to point to your mailbox name; @ foo.edu should be replaced with your e-mail address.
When I used this program, accessing it through a registration form, it returned the data shown in Listing 6.9.
Listing 6.9. CGI environment variables returned by the Mail
Environment Variables program.
Matt's New cgi-test script report Environment variables DOCUMENT_ROOT Value: /usr/local/business/http/accn.com URLed: %2fusr%2flocal%2fbusiness%2fhttp%2faccn.com UnURLed: /usr/local/business/http/accn.com GATEWAY_INTERFACE Value: CGI/1.1 URLed: CGI%2f1.1 UnURLed: CGI/1.1 HTTP_AccEPT Value: */*, image/gif, image/x-xbitmap, image/jpeg URLed: *%2f*,%20image%2fgif,%20image%2fx-xbitmap,%20image%2fjpeg UnURLed: */*, image/gif, image/x-xbitmap, image/jpeg HTTP_COOKIE Value: s=dialup-7207812894493652 URLed: s=dialup-7207812894493652 UnURLed: s=dialup-7207812894493652 HTTP_REFERER Value: http://www.accn.com/cgibook/chap6/call-mail.html URLed: http:%2f%2fwww.accn.com%2fcgibook%2fchap6%2fcall-mail.html UnURLed: http://www.accn.com/cgibook/chap6/call-mail.html HTTP_USER_AGENT Value: Mozilla/1.1N (Windows; I; 16bit) URLed: Mozilla%2f1.1N%20(Windows%3b%20I%3b%2016bit) UnURLed: Mozilla/1.1N (Windows; I; 16bit) PATH Value: /usr/local/bin:/usr/bin/:/bin:/usr/local/sbin:/usr/sbin:/sbin URLed: %2fusr%2flocal%2fbin:%2fusr%2fbin%2f:%2fbin:%2fusr%2flocal%2fsbin: %2fusr%2fsbin:%2fsbin UnURLed: /usr/local/bin:/usr/bin/:/bin:/usr/local/sbin:/usr/sbin:/sbin QUERY_STRING Value: first=Eric+&last=Herrmann&street=255+S.+Canyonwood+Dr.&city=Dripping+Springs&state=Texas &zip=78620&phone=%28999%29+999-9999&simple=+Submit+Registration+ URLed: first=Eric%2b&last=Herrmann&street=255%2bS.%2bCanyonwood%2bDr.&city=Dripping%2bSprings &state=Texas&zip=78620&phone=%2528999%2529%2b999- 9999&simple=%2bSubmit%2bRegistration%2b UnURLed: first=Eric &last=Herrmann&street=255 S. Canyonwood Dr.&city=Dripping Springs&state=Texas&zip=78620&phone=(999) 999-9999&simple= Submit Registration REMOTE_ADDR Value: 199.170.89.45 URLed: 199.170.89.45 UnURLed: 199.170.89.45 REMOTE_HOST Value: dialup-7.austin.io.com URLed: dialup-7.austin.io.com UnURLed: dialup-7.austin.io.com REQUEST_METHOD Value: GET URLed: GET UnURLed: GET SCRIPT_FILENAME Value: /usr/local/business/http/accn.com/cgibook/chap6/perltest.cgi URLed: _%2fusr%2flocal%2fbusiness%2fhttp%2faccn.com%2fcgibook%2fchap6%2fperltest.cgi UnURLed: /usr/local/business/http/accn.com/cgibook/chap6/perltest.cgi SCRIPT_NAME Value: /cgibook/chap6/perltest.cgi URLed: %2fcgibook%2fchap6%2fperltest.cgi UnURLed: /cgibook/chap6/perltest.cgi SERVER_ADMIN Value: webmaster@accn.com URLed: webmaster@accn.com UnURLed: webmaster@accn.com SERVER_NAME Value: www.accn.com URLed: www.accn.com UnURLed: www.accn.com SERVER_PORT Value: 80 URLed: 80 UnURLed: 80 SERVER_PROTOCOL Value: HTTP/1.0 URLed: HTTP%2f1.0 UnURLed: HTTP/1.0 SERVER_SOFTWARE Value: Apache/0.8.13 URLed: Apache%2f0.8.13 UnURLed: Apache/0.8.13
Not all environment variables are created equal. Why is it that you don't always know what's in the environment variable's associative array? The environment variable is the server's way of communicating with your CGI program, and each communication is unique.
The uniqueness of each communication with your CGI program is based on the request headers sent by the Web page client when it calls your CGI program. If your Web page client is responding to an Authorization response header from the server, it sends Authorization request headers. Because the request headers define a number of your environment variables, you can never be sure which environment variables are available.
Some of the environment variables always are set for you and are not dependent on the CGI request. These environment variables typically define the server on which your CGI program runs. The environment variables discussed in the following subsections are based on your server type and always should be available to your CGI program.
The environment variable GATEWAY_INTERFACE is the version of the CGI specification your server is using. The CGI specification is defined at
http://hoohoo.ncsa.uiuc.edu/cgi/
This is an excellent site for further information about CGI. At the time of this writing, CGI is at revision 1.1. You can see this in Figure 6.1. The format of the variable is
CGI/revision number
The environment variable SERVER_ADMIN should be the e-mail address of the Web guru on your server. When you can't figure out the answer yourself, this is the person to e-mail. Be careful, though. These people usually are very busy. You want to establish a good relationship early so that your Web guru will respond to your requests in the future. Make sure that you have tried all the simple things-everything you know first-before you ask this person questions. This is definitely an area in which "crying wolf" can have a negative effect on your ability to get your CGI programs working. When you have a tough problem that no one seems able to figure out, you want your Server Administrator to respond to your questions. So don't overload her with simple problems that you should be able to figure out on your own.
The environment variable SERVER_NAME contains the domain name of your server. If a domain name is not available, it will be the Internet protocol (IP) number of your server. This should be in the same URI format as that in which your CGI program was called.
The environment variable SERVER_SOFTWARE contains the type of server under which your CGI program is running. You can use this variable to figure out what type of security methods are available to you and whether SSIs are even possible. This way, you don't have to ask your Webmaster these simple questions.
This next set of environment variables gives your CGI program information about what is happening during this call to your program. These environment variables are defined when the server receives the request headers from a Web page. Some of these variables should look very familiar because they are directly related to the HTTP headers discussed in Chapter 2 "Understanding How the Server and Browser Communicate."
The AUTH_TYPE environment variable defines the authentication method used to access your CGI program. The AUTH_TYPE usually is Basic, because this is the primary method for authentication of the Net right now. AUTH_TYPE defines the protocol-specific authentication method used to validate the user. I discuss how to set up a user-password authentication scheme in Chapter 12, "Guarding Your Server Against Unwanted Guests." In the next chapter, you will use request headers and environment variables to perform user authentication.
The Content-Length environment variable specifies the amount of data attached to the end of the request headers. This data is available at STDIN and is identified with the Post or Put method.
The Content-Type environment variable defines the type of data attached with the request method. If no data is sent, this field is left blank. The content type will be
application/x-www-form-urlencoded
when posting data from a form.
The HTTP_REQUEST_METHOD environment variable is the HTTP method request header converted to an environment variable. You might remember that the following request methods are possible: Get, Post, Head, Put, Delete, Link, and Unlink. Get and Post certainly are the most common for your CGI program and define where incoming data is available to your CGI program. If the method is Get, the data is available at the query string. If it is Post, the data is available at STDIN, and the length of the data is defined by the environment variable CONTENT_LENGTH. The Head request method normally is used by robots searching the Web for page links. The other methods are not quite as common and tell the server to modify a URL or file on the server.
The PATH environment variable is not strictly considered a CGI environment variable. This is because it actually includes information about your UNIX system path. This was discussed in "The Path Environment Variable," earlier in this chapter.
The PATH_INFO environment variable is set only when there is data after the CGI program (URI) and before the beginning of the QUERY_STRING variable. Remember that the query string begins after the question mark (?) on the link URI or Action field URI. PATH_INFO can be used to pass any type of data to your CGI program, but it usually sends information about finding files or programs on the server. The server strips everything after it finds the target CGI program (URI) and before it finds the first question mark. This information is URI-decoded and then placed in the PATH_INFO variable.
The PATH_TRANSLATED environment variable is a combination of the PATH_INFO variable and the DOCUMENT_ROOT variable. It is an absolute path from the root directory of the server to the directory defined by the extra path information added from PATH_INFO. This is called an absolute path. This type of path often is used when your CGI program moves in and out of different directories or different shell environments. As long as your server doesn't change, you can use the absolute path regardless of where you put or move your CGI program. Sometimes absolute paths are considered bad because you cannot move your CGI program to another server. You have to decide which is more likely:
The QUERY_STRING environment variable contains everything included on the URI after the question mark. The setup for a query string normally is performed by your browser when it builds the request headers. You can create the data for your own query string by including a question mark in your hypertext reference and then URI-encoding any data included after the question mark. This is just one more way to send data to your program. Two big drawbacks to using QUERY_STRING are the YUK! factor and the size of the input buffer. The YUK! factor means that your data is displayed back to your client in the Location field. The size problem means that you have a limitation on how much data you can send to your program using this method. The amount of data you can send without exceeding the input buffer is server specific, so I can't give you any hard rules. But you should try to limit all data you send using this method to less than 1,024 bytes.
The REMOTE_ADDR environment variable has the numeric IP address of the browser or remote computer calling your CGI program. Read the REMOTE_ADDR from right to left. The furthest right number defines today's connection to the remote server. Or, at least, this is the case when your Web browser client connects from a modem to a commercial server.
The REMOTE_HOST environment variable contains the domain name of the client accessing your CGI program. You can use this information to help figure out how your script was called. If the domain name is unavailable to your server, this field is left empty. If this field is empty, the REMOTE_ADDR environment variable is filled in. Your program can read this environment variable from right to left. There can be more than one subhierarchy after the first period (.), so be sure to write your code to deal with more than one level of domain hierarchy to the left of the period.
The REMOTE_IDENT environment variable is set only if the remote username is retrieved from the server using the IDENTD method. This occurs only if your Web server is running the IDENTD identification daemon. This is a protocol to identify the user connecting to your CGI program. Just having your system running IDENTD is not sufficient, however; the remote server making the HTTP request also must be running IDENTD.
The REMOTE_USER environment variable identifies the caller of your CGI program. This value is available only if server authentication is turned on. This is the username authenticated by the username/password response to a response status of Unauthorized Access (401) or Authorization Refused (411).
The SCRIPT_FILENAME environment variable gives the full path to the CGI program. You do not want to use this variable when building a self-referencing URI. Remember that the server is making some assumptions about how you will access your CGI program. The full pathname would be appended to the server's full pathname, thereby totally confusing your poor server. The server starts with the server name, and from there it determines the document root; then it adds the path to your CGI program.
The SCRIPT_NAME environment variable gives you the path and name of the CGI program that was called. The path is a relative path starting at the document root path. You can use this variable to build self-referencing URLs. Suppose that you want to return a Web page and you want to generate an HTML that includes a link to the called CGI program. The print string would look like this:
print "<a href=http://$SERVER_NAME$SCRIPT_NAME> This is a link to the CGI program you just called </a>";
The SERVER_PORT environment variable defines the TCP port to which the request headers were sent. As discussed in Chapter 2 the port is like the telephone number used to call the server. The default port for server communications is 80. When you see a number appended to the domain name server, this is the port number to which the request was sent-for example, www.io.com:80. Because the default port is 80, it generally is not necessary to include the port number when making URI links.
The SERVER_PROTOCOL environment variable defines the protocol and version number being used by this server. For the time being, this should be HTTP/1.0. The HTTP protocol is the only server protocol used for the WWW at the moment. But, like most good designs, this environment variable is designed to allow CGI programs to operate on servers that support other communications protocols.
"How can I tell who is using my Web site?" This question is asked over and over again. It is asked by professionals and amateurs. It's natural to want to know who is using your Web site. In the next several pages, you will take a look at this question and see how close you can come to answering it. You'll start with the easier problems and work up to the harder problem of who is visiting your Web site.
Before you get started on this topic, let me give you the standard Net advice. The Internet is most loved for its anarchy and anonymity. People can cruise the Net and feel like they are doing it anonymously. Don't abuse the capability to get people's names or links, or you will find your Web site quickly blacklisted and abandoned. News travels quickly on the Net, and bad news about your Web site travels even faster.
Let's start with an easy one first. Suppose that your only goal is to figure out how your Web site is getting called. Where are all these hits coming from? Well, the environment variable with that answer is HTTP_REFERER.
Notice that this environment variable is prefixed with HTTP_. All the request headers sent by the browser are turned into environment variables by your server, the request headers are prefixed with HTTP_, and the request header is capitalized. This is both good and bad. Because not all browsers are created equal, you cannot depend on getting the same request headers with every call. In other words, not all browsers will send the Referer request header, so you might not have the HTTP_REFERER environment variable available. On the other hand, because all browsers tell the server what type of client they are, you can write your code to work with the browsers that send you the HTTP_REFERER environment variable. There are two ways to handle this, and I'll show you both methods.
First, you could check for the browser type. You did this back in Chapter 2. The browser type is in the environment variable HTTP_USER_AGENT. Listing 6.10 shows a code fragment for getting out Netscape's Mozilla and version number. This actually is probably the harder method. But if you want to do specific things based on the HTTP_USER_AGENT type, this is the way to go. You might want to build a table with all the different HTTP_USER_AGENTs you're interested in, and then you could use loop through the table to look for valid HTTP_USER_AGENTs.
Listing 6.10. A program fragment for decoding HTTP_USER_AGENT.
1: @user_agent = split(/\//,HTTP_USER_AGENT); 2: if ($user_agent[0] eq "Mozilla"){ 3: @version = split(/ /, $user_agent[1]); 4: $version_number = substr($version_number, 0, 3)};
If you just want to make sure that the HTTP_REFERER environment variable is defined, use the Perl defined function. Because all you are trying to do is determine whether the HTTP_Referer environment variable is set, this seems like a more straightforward approach.
Use the Perl fragment
if (defined ($ENV{'HTTP_REFERER'})
to determine whether HTTP_REFERER is set and then perform a specific operation. From here, you can open a file or send yourself mail.
Back to HTTP_REFERER. This environment variable contains the full URI reference to the calling Web page. Just save the value to a file, and you've got the link back to the calling Web page.
That's the easy one. Now take a look at what is and isn't possible with some other environment variables that contain more specific information about your Web site visitor. First, the two that are the most likely to have information in them: the REMOTE_HOST and the REMOTE_ADDR variables.
The REMOTE_HOST environment variable usually is filled in. It contains the domain name of your Web site visitor's server as you normally would type it in the Location field of your Web browser. You can use this field to begin getting some ideas on how your Web site is linked around the Net. Or, you might have a list of trusted sites that you compare the REMOTE_HOST environment variable with to determine who you want to allow access to your Web page.
If you want more specific information about where in the country the calling Web site is located, use the InterNIC whois command. Telnet into your server and type the name of the REMOTE_HOST environment variable. Figure 6.4 shows an example of the whois command. As you can see, there is quite a bit of information provided here about what type of server is calling you. You might find this handy to use if you are having problems with a robot from this site and the 'bot does not contain an HTTP_FROM environment variable. With this information, you can go to the registered administrative contact and resolve your problems with the errant robot.
Figure 6.4 : Using the whois command to identify REMOTE_HOST .
Even if the REMOTE_HOST environment variable is not filled in, the REMOTE_ADDR always will be set. This variable contains the IP address of the calling Web page's server. You can use the whois command with this environment variable also. You are likely to get a different set of information back, however. The whois command used on the IP address returns the main server. You might find that your REMOTE_HOST name is only a subpart of an existing server. You normally will want to ignore the far right field in the IP address. InterNIC does not give registration information beyond the first three dotted decimal IP address numbers. You can see the results of the whois command in Figure 6.5. I have performed all these tasks manually, but you easily could add to the script fragment in Listing 6.11 to handle this type of work for you.
Figure 6.5 : Using the whois command to identify REMOTE_ADDR .
Before you save HTTP_REMOTE_ADDR, you should clean up the IP address. The IP address should be limited to the first three IP numeric registration levels. So, if the address in the HTTP_REMOTE_ADDR environment variable is 199.17.89.65.99, you only want 199.89.65. The Perl fragment in Listing 6.11 performs this work for you.
Listing 6.11. Cleaning up HTTP_REMOTE_ADDR.
($part1, $part2, $part3, $the_rest) = split(/\./$ENV{'HTTP_REMOTE_ADDR'}, 4); $address = $part1 . '.' . $part2 . '.' . $part3; print (output_file, "$address\n") ;
So far, you have been able to tell where the links to your Web site are originating from and to get information about the server where those links are connected.
Now let's look at the three environment variables that are supposed to contain the name of your Web site visitor: HTTP_IDENTD, HTTP_FROM, and REMOTE_USER.
First, let's deal with and then ignore the environment variable HTTP_IDENTD. This is a lousy means of confirming who is visiting your Web site. It only works if both the client and the server are running the IDENTD process. Even if the server is doing everything correctly, HTTP_IDENTD still can fail when you try to use this method, because you are dependent on the client's server also performing correctly. Even when everything works, the process requires extra communication between the server and the client, and that can really slow things down.
In the best of worlds, you are in charge of the server and you can turn on IDENTD yourself. But, more than likely, you are not the owner of the server and you would have to convince someone to turn on the IDENTD daemon. And you still must deal with the fact that your clients can come from any server in the world. There is no way you can force them to run IDENTD.
This all just seems like way too much work to me, so I suggest that you avoid the HTTP_REMOTE_IDENT environment variable as a solution to validating users. In the next chapter, you will learn how to set up basic user authentication using a username/password scheme. That methodology is much more reliable than the HTTP_REMOTE_IDENT environment variable.
So let's take a look at the last two environment variables: HTTP_FROM and REMOTE_USER.
HTTP_FROM is supposed to be set to the e-mail address of your Web site visitor. This has become an issue on the Net, though. People are afraid of unscrupulous Web sites getting their electronic name and address and selling it or using it for other commercial purposes. If junk e-mail isn't a problem for you yet, I'm betting it will be some time in the future.
So, to prevent themselves from getting a bad reputation, most browsers no longer support this feature. Or, if they do, they allow users to turn off this identification method. So, unfortunately for us, this environment variable is best used only as a default value for a return e-mail address.
Well, we are down to the last environment variable that can help us: the REMOTE_USER environment variable. Will this one tell you who is accessing your Web site? Yes-BUT, you won't like the way it is set. This environment variable is set only if an authentication scheme is being used between the browser and the server.
This isn't quite as hard as you might expect it to be. In order to set up user authorization, you need to set protections on your files or directories and create a password file for validated users. In Chapter 7 you will build an entire application that includes registering users, building a password file, and validating a user. So don't despair; I will cover how to do this in detail in the next chapter.
Unfortunately, I haven't given you any easy answers for how to get the name of someone visiting your site. It certainly is possible, and you can gather some information with existing environment variables. But in the long run, unless you want to validate every user, you are going to have to make do with less than you probably wanted to. At least now you have the full picture.
I have saved the dessert for last. The cookie, as it is fondly called, is one of the most powerful environment variables of the HTTP environment variables. I saved this variable for last for three reasons. First, it's only implemented for Netscape browsers. Second, it can really enhance your ability to treat a Web site visit as if a customer just entered your place of business. Third, it requires some detailed explanation.
One of the problems with building applications on the Internet is writing programs that remember what they were doing with customer X. When you cruise the Internet, each new link is a new connection to the server. It doesn't have any way of knowing what happened during the last connection. This means that each time your CGI program is invoked, you don't know what happened the last time.
Why do you care? Well, I expect online catalogs to be a major new programming application on the Internet, for example. But the first problem you run into is keeping track of what each customer is selecting for his purchases.
Imagine that you have three Web page customers at one time. Each of them is clicking on products, and your job is to keep track of who gets what. Just storing the data in a file isn't enough. If you have three customers, each making purchases, then you are going to need three separate files-one for each customer. How do you decide who is making the next purchase? Especially if they happen to be coming from the same server? Do you need to get the customer's name each time she makes a new selection? Yes! In some way, you must be able to separate your customers. Well, the Netscape cookie was built to help you solve that problem.
The Netscape cookie shows up in your environment variables only if the browser accessing your Web page is a Netscape browser. The environment variable is HTTP-Cookie, and it is a marvelous tool for maintaining state.
Remember that your browser sends a request header to your server, and then the server turns that request header into an environment variable. This means that after your CGI program sends the cookie to the browser, the browser is responsible for keeping track of it and returning it as a request header. So, each time your client submits one of your forms, you get a cookie that tells you which client it is.
Cookies are passed back and forth between the client and the server to identify a particular Web client. How does this chain of cookies get started?
When your Web site client first visits your Web page, he connects to your sever and probably requests your home page. Unless your home page is a CGI program, no cookies are exchanged yet. When your Web client submits to your CGI program the first time, no cookie exists. Your CGI program responds to the submittal with some type of Set-Cookie response header. You can generate a cookie based on the domain IP number and the current time. You then can send this cookie to the submitting browser as part of the normal response headers. This Set-Cookie response header might look like this:
Set-Cookie: customer=$ENV{'HTTP_REMOTE_ADDR'} . $ENV{'DATE'};
This generates a unique cookie that the browser will send you the next time your Web client clicks on any Web page within your server root. You now can identify this client every time he accesses any Web page on your server root because the browser always will send this unique cookie, and your CGI program that previously saved the cookie can compare the cookie the browser sent with the saved cookie. The idea is that the requested URI will get only cookies that it knows how to interpret.
The Set-Cookie response header is made up of several fields. The format of the Netscape cookie is not very complex. The server sends to the browser a Set-Cookie response header. The only required field in the Set-Cookie response header is the name of the cookie and the value to assign to that cookie. So a valid Set-Cookie response header is
Set-Cookie: customer=Jessica-Herrmann;
The Set-Cookie response header has several fields. Each field can be used only once per Set-Cookie response header. If you need to send more than one name=value pair back to the client browser, it is okay to send multiple Set-Cookie response headers in a single response header chain.
If all the fields of the Set-Cookie response header are used, the cookie looks like this:
Set-Cookie: customer=Steve-Herrmann; expires=$ENV{'DATE'} + 2 HOURS ; domain=www.practical-inet.com; path=/cgibook ;
The semicolon (;) is used to separate the cookie fields.
The Name=Value field is required and defines the uniqueness of a cookie to the browser. Don't be confused by this and the name/value pairs of forms. The name in this field should be set to a variable name that you will use in your CGI program-for example, customer or book. The value probably will be based on something your customer submits. You can send only one name=value pair per Set-Cookie response header. You can send multiple Set-Cookie response headers, however.
The Name field is the only required field of the Set-Cookie request header.
The Expires=Date field is a command to the browser. It tells the browser to remember this cookie only until the expiration date given in the Expires field. When the expiration time is reached, the cookie is forgotten and is not sent to the server on any further connections.
This field is not required; if it is not set, the browser remembers the cookie throughout one Internet connection. So you can browse for hours, change Web pages, and return; as long as you don't close Netscape, it remembers your cookie.
The Domain=Domain_Name field should be set to the domain name of the server from where URI is fetched. So, if your form is submitted to
www.practical-inet.com/cgibook/chap6/test-cookie.cgi
the Domain field should be
Domain=www.practical-inet.com
The Domain field is not required and defaults to the server that generated the Set-Cookie response header.
The Path=Path field is used to limit the URIs with which the cookie can be used. So, if I want a cookie to match only if you stay in my chap6 directory, I can send a Set-Cookie request header with a path of /cgibook/chap6.
The path is not required, and if it is not included, it is set to the path to the URI sending the Set-Cookie request header.
When the browser is deciding which cookies to send with the request headers, it looks at the domain name it is accessing and matches all those cookies. Then, it looks at the URI and the path and matches any cookies that have a path matching the path of the URI.
This works because the match is from most general to specific. If the path is / or the server root, everything from the server root and below matches. If the path is /cgibook/chap6/, everything in the Chapter 6directory and below is a path and URI match, and the browser is sent that cookie.
Think of a cookie as a ticket. A ticket is given each time your browser accesses a URI that sends a Set-Cookie response header. The ticket has information on it about who should get a copy of the ticket. The browser's job is to look at each ticket it has in memory each time it accesses a URI. If the information on the ticket says that this URI should get a copy of the ticket, the browser sends a copy along with its regular request headers.
Your code can look at the ticket and then determine from the Name=Value field to which customer the ticket belongs. Then you can go to the files that contain customer session information. Compare the cookie with the cookies in each file until you find a match. Or use the cookie to create a unique filename and get the correct file without performing a search.
In this "Learning Perl" section, you will learn about managing files and some of Perl's more important special variables. You will use files throughout your CGI programs, so it's a good idea to have a strong foundation in dealing with files and filehandles. Later in this section, in "Using Perl's Special Variables," you'll learn about a group of special variables; these can make your coding task easier, but they also make your programming more cryptic. Use Perl's special variables as you need them, but use them with care.
You've already seen several examples of reading and writing to files. During this exercise, you'll learn about some of Perl's built-in functions for manipulating files.
In the programming world, just like in any other profession, the experts seem to forget that they didn't understand everything when they started programming. I try not to be guilty of this, but I'm sure there are times when more explanation would be helpful. The goal of this exercise is to remove any barriers to understanding how a program reads and writes to files.
Let's start with the basic concepts of a filename-which also is referred to as a file variable-and a filehandle.
The filename is the actual name of the file your program is trying to read from your hard disk into computer memory or write from computer memory to your hard disk. If the file your program is trying to read from or write to can be in a different directory than the directory from which the program was started, you should supply the full path to the file in your program. The path to your file is called the pathname. The pathname to the file should start at the root directory. If you are using a UNIX platform, this means starting your pathname with a forward slash (/). If you are using a Windows/DOS platform, this usually means starting the path with the disk drive letter and then a backward slash (C:\).
On a UNIX platform, if you were reading a file from your home directory, it might be expressed as this:
/export/home/usr/herrmann/input_data.txt
The filename is input_data.txt. The pathname is /export/home/usr/herrmann/. You can use this filename and pathname in your program store by just referencing it inside double quotation marks like this:
"/export/home/usr/herrmann/input_data.txt"
I recommend that you save this pathname and filename to a variable for use throughout your program, as shown here:
$inFile = "/export/home/usr/herrmann/input_data.txt";
The $inFile variable is referred to as a file variable. You can use either format to open a file for reading or writing. As far as Perl is concerned, they are exactly the same thing.
The filehandle is not the same thing as a filename or a file variable. The filehandle has special meaning to the Perl interpreter; it is Perl's attempt to find the filename you passed to Perl using the open command. If Perl is successful at finding the file, it creates a special link to the file in computer memory. This link remains in effect until you use the close command on the filehandle or you use the same filehandle in another open command.
After you open a file, especially for writing, it is very important
to close the filehandle when you are done working with the file.
If you are writing to a file, its likely that all the data is
not written to your file when your program executes the
print or write
statement. Writing to files or any input/output (I/O) operation
is usually much slower than the speed of your CPU. Your operating
system usually tries to help by collecting a group of file output
operations before actually performing the output. This is called
output buffering. Usually, the final contents of the output
buffer are not written to the file until you close a filehandle.
Emptying the output buffer by closing the file or by using some
other means is called flushing the buffer.
Tip |
Things usually will work out okay if you don't close your file. But programming is not about usually. I guarantee that if you do not close all the files you open after you are done with them, you will have problems with your programs. The problems created by not closing your files will be the most irritating types of problems. They won't happen all the time, and they won't have the same results each time they happen. You will save yourself countless headaches and lost hours in program debugging if you always close open filehandles after you are done manipulating the file. |
Always remember to open a file before trying to read it. Doesn't that sound silly? Yet it's a common mistake to try to read a file without opening it. The computer doesn't have x-ray vision any more than you do. You can't read a book until you've opened the cover, and a computer can't read a file until you open the file for it. The syntax for the open command is quite simple:
open(FILEHANDLE,"filename");
The filename also can be a file variable. If you are using a filename, remember to use double quotation marks around the filename.
Closing a file is even easier than opening a file. The syntax of the close command is
close(FILEHANDLE);
This exercise is a minor rewrite of Exercise 5.1, Using ARGV, to illustrate the use of filehandles. Take a careful look at the two programs; they produce identical results. Listing 6.12 contains the program you should type in for this exercise.
Listing 6.12. Using filehandles.
01: #!/usr/local/bin/perl 02: if ($#ARGV < 2) 03: { 04: print<<"end_tag"; 05: 06: # $0 opens a file for reading and changes a name in the file 07: # use: $0 OLD_NAME NEW_NAME FILE_LIST 08: # param 1 is the old value 09: # param 2 is the new value 10: # param +2 is file list. There is no programatic limit to the number of files processed 11: # the original file will be copied into a .bak file 12: # the original file will be overwritten with the substitution 13: # the script assumes the file(s) to be modified are 14: # in the directory that the script was started from 15: # SYMBOLIC LINKS are NOT followed 16: end_tag 17: exit(1); 18: } 19: 20: $OLD = shift; # dump arg(0) 21: $NEW = shift; # dump arg(1) 22: # now argv has just the file list in it. 23: 24: select(OUTFILE); 25: while (<>) 26: { 27: next if -l $ARGV; #skip this file if it is a sym link 28: $count++ ; 29: print STDOUT "." if (($count % 10) == 0); 30: 31: if ($ARGV ne $oldargv) #have we saved this file ? 32: { 33: close(OUTFILE); 34: print STDOUT "\nprocessing $ARGV ..."; 35: $count = 0 ; 36: rename($ARGV, $ARGV . '.bak'); #mv the file to a backup copy 37: $oldargv = $ARGV ; 38: open (OUTFILE, ">$ARGV");# open the file for writing 39: } 40: s/$OLD/$NEW/go;# perform substitution 41: # o - only interpret the variables once 42: print; #dump the file back into itself with changes 43: } 44: close(OUTFILE); 45: select(STDOUT);
On line 24,
select(OUTFILE);
the default filehandle is changed from STDOUT to OUTFILE. The select command selects the default filehandle used by the print command. I find it interesting that OUTFILE can be used as a filehandle before it actually is associated with an open file. Perl trusts you to do the right thing. So you'd better, or your program is really going to get confused.
Line 25,
while(<>)
replaces the double while loop of Exercise 5.1. This while conditional expression does the following:
You should notice that you had to move lines 34 and 35 of Listing 5.13 inside the block of statements following the if statement on line 31. You need to do this so that these lines will be executed only when a new file is opened. This was accomplished in Listing 5.13 because the inner while loop executed until each file was completely read, and only then was a new file opened for reading.
Line 29,
print STDOUT "." if (($count % 10) == 0);
illustrates using STDOUT as a filehandle. Have you figured out what happens if you forget to include STDOUT in the print statement? Your output goes to the selected filehandle, which is your file. Try it and see.
Line 33,
close(OUTFILE);
seems just as out of place as line 24. The first time through the code, there isn't any open file. But you should get in the habit of closing your filehandles before opening a new file. This close takes care of closing the open filehandle for the remaining times through the loop when the filehandle is open.
Perl has lots of special variables to help make your programming task easier. For the novice, however, these special variables can make life very confusing. All kinds of neat things seem to be happening in the code, but you can't figure out what makes the code work. In this section, you will learn about some of the more common special variables. Perl has more special variables than are listed here, but this list includes the variables I think you'll see most of the time.
The input and output special variable ($|) affects when your print and write statements actually send data to your file. According to the Perl manual, it only affects the selected filehandle, so you first must use
select(FILEHANDLE);
before setting the input and output special variable ($|).
The input and output special variable ($|) can have an impact on your HTML and CGI programs. If you are printing to the default selected filehandle, which is STDOUT, and $| equals 0, your output is held in memory until Perl decides that it has enough output data to bother with. This is called output buffering and is an efficient method of managing printing. Printing is typically a very slow operation as far as the computer is concerned, so the computer tries to limit the number of times it prints by doing a bunch of printing at a time.
You normally don't care about this, but if you are sending HTML through a CGI program and you also are doing some other processing with that CGI program, you probably want the HTML to go to your user as soon as it's ready. Your computer may buffer that data until your program is done unless you tell it not to.
To make the computer send your data (HTML) as soon as it executes the print command, set $| to 1:
$|=1;
To let the computer buffer your data for efficiency, set $| to 0:
$|=0;
Remember that $| only affects the selected filehandle. If you want to be sure that you're affecting STDOUT or a particular file, always select the file before setting $|.
The global special variable ($_ ) is the stealth special variable. You never see it, even when it's used in action, unless it wants you to see it. This is probably one of the more popular and well-known special variables. The global special variable ($_ ) has different meanings based on how it is used in your program. That makes it even more confusing to the unwary. You'll think you understand this variable, because you've seen it used to print file data. But that's not its only meaning, and it only means this when reading files. For the sake of your own sanity, I suggest that you think of the global special variable ($_ ) as two separate variables.
First, when the global special variable ($_ ) is used in its input context, it is the default variable for data storage. This means that if you're using the angle brackets (<>) as an input symbol for reading from a file, each line you read from that file is placed, one line at a time, into $_. Read that sentence one more time, please. Don't get confused. The global special variable ($_ ) does not contain every line of the file you just read in. It contains the last line you read in from your file.
So, when you write
while (<>){...}
each line of the file is being read into $_ each time the conditional expression of the while loop is executed.
You also could write
while($line = <>){...}
and the line from your file would be stored in the variable $line.
When you print something, the global special variable ($_ ) is used if you don't give the print command any data to print. The print and chop functions follow these rules:
The second way to view the global special variable ($_ ) is as the default variable in Perl functions that operate on data.
Specifically,
Honestly, there are Perl functions that use $_, but these are the most common ones I think you'll see. When you see these functions/commands and you don't see them operating on any specific data, they are using the global special variable ($_ ). And the global special variable ($_ ) had better have been set by something earlier in your code, or these functions are not going to work very well.
The pattern-matching command generally is used inside the if conditional expression:
if (/Pattern/){...}
In this case, Pattern is being matched against the global special variable ($_ ).
The substitution command is used quite frequently in this context:
$newdata = s/$OLD/$NEW/g ;
or even
s/$OLD/$NEW/g;
In the first case, if $OLD can be found in the global special variable ($_ ), each occurrence is replaced with $NEW. The resulting string is stored in $newdata. The second case works just like the first case, but the data is stored back into the global special variable ($_ ).
Split is one of my favorite functions. When you see it used without a variable as input, the global special variable ($_ ) is the default variable on which split operates. This means that the following code is equivalent:
split(/\s+/); split(/\s+/,$_)
You probably won't use this one very often, but like most special tools, when you need it, you'll be glad you knew about it. $* changes the pattern-matching operators so that they match on multiple lines of input. Normally, each match is performed on just one line. As soon as a newline character is found, the match or substitution operator thinks it is done. Sometimes you want to read in several lines of data and match even if a newline character (\n) is in the middle of the line. When you want to do this type of matching, set $* to 1:
$*=1;
The default for $* is 0, which means to match only on one line at a time.
Remember to set the multiline special variable ($*) back to 0 when you're done using it for your special case:
$*=0;
The special variables ARGV, $ARGV, @ARGV, $#ARGV, and $0 are all closely related and tied to the command line. Each of these command-line variables is explained in the following list:
In this chapter, you learned that there are three types of environment variables; the ones you get at the command line, within your CGI program, and for SSI commands are each different. This happens because the scope of environment variables is at the process level, and the process environment is different for each.
You learned that scope defines the area within which a variable can be used and that you can limit the scope of a variable to the enclosing code block (enclosed in curly braces) by using the Perl local function.
This chapter discussed the two types of CGI environment variables: the server environment variables and the environment variables based on HTTP request headers. The server environment variables always are available for your CGI program, but the set of HTTP request header environment variables differs with every client connection.
This chapter also covered how you can use the HTTP request header environment variables to get a lot of information about each visitor to your Web site, but getting the name of that visitor often is difficult. Finally, you learned that the Netscape cookie is an excellent means of maintaining information about each client who connects to your Web site.
In this chapter, you told us about the Path environment variable issued for searching for programs. In the last chapter, you said this was done with the @Inc array. What gives? | |
Would you believe me if I told you that I told you the truth both times? Well, I did. The difference is who or what is doing the looking. The @Inc array is another of Perl's special variables, so it must be used by Perl. And, of course, it is. It is used only when you use the require function. The require function tells Perl to add whatever Perl code is in the require parameter list to the list of code it will execute. The require command only uses the list of directories in the @Inc array as a search path. But when you try to execute a system or another CGI program from within your CGI program, the Path variable is used by the UNIX operating system to search for the system command you requested. | |
If I modify my environment variables, will they be there when I try to use them the next time? | |
No. Environment variables have process scope. This means that they are available to every executing program within that process. As soon as your CGI program stops executing, however, the process that enclosed it ends. So any environment variables that you set end with that process. When your CGI program is started again, even if from exactly the same connection, an entire new process is started with an entire new set of environment variables. |