Passing data to your CGI script
Returning the results
Chapter 1 explained what CGI is and how it fits into publishing on the World Wide Web. This chapter describes how data is passed between your Web server and your CGI script. You will discover where to look for the data you need and how to pass the results of your script back to the Web server. You will even learn how to send data from your CGI script directly back to the Web browser, bypassing the server altogether.Remember, the Common Gateway Interface is simply a way of passing data between your Web browser/server and your CGI script. This chapter presents all of the tools you need to begin writing CGI scripts. So, let's start with how the Web server passes data to your CGI script and then move on to returning your results to the Web browser.
Passing data to your CGI script
You don't need to do much to ensure that your CGI script receives the necessary data from your Web server. The CGI has already defined how this is done and the task is performed automatically every time your Web server executes a CGI script. All of the relevant data sent to the server from the Web browser, such as form input, plus the HTTP request headers are sent from the server to the CGI script in either environment variables or by standard input (stdin), which is the default location at which your program receives input. Because this task is done for you, all you have to know is where to look for the information you need.
Environment Variables
When a Web browser requests a CGI script from a Web server, the server starts the CGI program in what is termed a stateless environment. What this means is that the CGI script is running in its own state or environment. It does not inherit values from the environment that the Web server is running under. This is important because many Web browsers can be requesting the same CGI script at the same time, and the Web server can start many copies of the same script. Each version of the script that is running concurrently must run independently from all the other scripts, otherwise conflicts may arise. Because the Web server sets up a new environment for your CGI script, it places almost all of the information available to the script in environment variables. Table 2.1 lists the CGI environment variables.
Table 2.1: CGI Environment Variables Variable
Meaning
AUTH_TYPE Contains the authentication method used to validate the Web browser, if any is used. An example of an authentication method is a username/password scheme. CONTENT_LENGTH The length of the user-provided content from the Web page requesting the CGI script, which is sent via the user's Web browser. Because the user-provided content is passed to the CGI script as a string, this value is in bytes, with each byte representing one character. CONTENT_TYPE Contains the type of the data that accompanies the browser's request for the CGI script. Examples are text/html or image/jpeg. GATEWAY_INTERFACE Holds the version of the Common Gateway Interface being used. For version 1.1 of the CGI specification, this variable would be CGI/1.1. PATH_INFO Holds additional path information for the CGI script. This is usually the virtual path to another document in the document root that the CGI script will use. This value is set from the information appended to the URL requesting the CGI script. See PATH_TRANSLATED for an example. PATH_TRANSLATED Holds additional path information for the CGI script. This is usually the virtual path to another document in the document root that the CGI script will use. This value is set from the information appended to the URL requesting the CGI script. See PATH_TRANSLATED for an example. QUERY_STRING Contains the user-provided data when the request method is GET. This data is appended along with a question mark to the referenced URL. For example, in the URL http://www.robertm.com/cgi-bin/answer.pl?state=CA, the QUERY_STRING would be "state=CA." REMOTE_ADDR Stores the IP address of the machine running the Web browser requesting the CGI script. REMOTE_HOST Stores the domain name of the machine running the Web browser requesting the CGI script. If this information is unavailable to the Web server, REMOTE_ADDR will be set and REMOTE_HOST will not be set. REMOTE_IDENT Stores the user's login name only if the Web server supports identification. REMOTE_USER Stores the username the Web browser specified for authentication. This is only set if the server supports authentication and the CGI script is protected. REQUEST_METHOD Contains the request method used to request the CGI script. This can contain any of the valid HTTP request methods such as GET, HEAD, POST, PUT, and so on. SCRIPT_NAME Stores the virtual path and name of the CGI script being executed. This is used for self-referencing URLs. SERVER_NAME Contains the name, either domain name or IP address, of the machine running the Web server. SERVER_PORT Contains the port number on which the Web browser sent the request to the Web server. SERVER_PROTOCOL Contains the name and version of the protocol being used to make the request for the CGI script. In most cases, this will be the HTTP protocol and will look something like HTTP/1.0. SERVER_SOFTWARE Stores the name and version of the Web server software that executed the CGI script. For example, for the Netscape Communications Server version 1.1, the variable would be set to Netscape-Communications/1.1.
In addition to the CGI environment variables, the Web server makes available all the HTTP request headers received from the Web browser. These are also placed in environment variables, all of which have the prefix HTTP_. Table 2.2 lists the HTTP request header environment variables.
Table 2.2: HTTP Request Header Environment Variables HTTP Request Header
Meaning
HTTP_ACCEPT Contains a comma-separated list of media types the browser can accept in response from the Web server. Examples are audio/basic, image/gif, text/*, */*. The last two examples contain the wildcard *, which is a stand-in for any string of characters. text/* means that all forms of text can be accepted; */* means that the browser will accept any content type. HTTP_ACCEPT_ENCODING Contains the valid encoding methods the browser can receive in response from the Web server. Examples are x-zip, x-stuffit, and x-tar. HTTP_ACCEPT_LANGUAGE Contains the browser's preferred language for a response from the Web server. However, responses in any language not specified in this variable are allowed. An example is en_UK, which is the English of the United Kingdom. HTTP_AUTHORIZATION Contains authorization information from the Web browser. Its value is used for the browser to authenticate itself with the Web server. There is not a single specific format for possible values of this field, and new formats may be added. One example is the user/password scheme, where the value, in my case, would be user robertm:mypassword. HTTP_CHARGE_TO Formats for this field are still undetermined. However, it is available to contain information for the account that is to be charged for the costs of receiving the requested data. HTTP_FROM Contains the name of the requesting user as supplied by the Web browser in an e-mail address format. Some examples are robertm@deltanet.com and rmcdanie@primenet.com. HTTP_IF_MODIFIED_SINCE Can contain a value specified in a valid ARPANET date standard, such as Weekday, DD-Mon-YY HH:MM:SS TIMEZONE. This field can be used in conjunction with the GET method to return the requested document only if it has changed since the date specified. HTTP_PRAGMA Holds the value of any special directives for the Web server. For instance, a proxy Web server has one valid value for a pragma request header, no-cache, which means that the proxy server should always request the document from the real Web server instead of returning a nonexpired cached copy. HTTP_REFERER Contains the URI (uniform resource identifier, which is a superset of URLs) of the document that contained the link to the currently requested document. An example would be http://www.thepalace. com/web-pages.html. HTTP_USER_AGENT Contains the name of the Web browser software that requested the document. An example is Mozilla/2.0 (Win95; I), which would be the user agent for the Netscape 2.0 browser for Windows 95.
Clearly there are many environment variables available to your CGI script. For the most part, you will only use a few of these. Of course, your objective will determine which variables you need for your project. Listing 2.1 shows a CGI script that displays the values of the CGI and HTTP request header environment variables.
Listing 2.1: The display.pl CGI Script #!/usr/local/bin/perl print "Content-type: text/html\n\n"; print "AUTH_TYPE = $ENV{'AUTH_TYPE'}<BR>\n"; print "CONTENT_LENGTH = $ENV{'CONTENT_LENGTH'}<BR>\n"; print "CONTENT_TYPE = $ENV{'CONTENT_TYPE'}<BR>\n"; print "GATEWAY_INTERFACE = $ENV{'GATEWAY_INTERFACE'}<BR>\n"; print "PATH_INFO = $ENV{'PATH_INFO'}<BR>\n"; print "PATH_TRANSLATED = $ENV{'PATH_TRANSLATED'}<BR>\n"; print "QUERY_STRING = $ENV{'QUERY_STRING'}<BR>\n"; print "REMOTE_ADDR = $ENV{'REMOTE_ADDR'}<BR>\n"; print "REMOTE_HOST = $ENV{'REMOTE_HOST'}<BR>\n"; print "REMOTE_IDENT = $ENV{'REMOTE_IDENT'}<BR>\n"; print "REMOTE_USER = $ENV{'REMOTE_USER'}<BR>\n"; print "REQUEST_METHOD = $ENV{'REQUEST_METHOD'}<BR>\n"; print "SCRIPT_NAME = $ENV{'SCRIPT_NAME'}<BR>\n"; print "SERVER_NAME = $ENV{'SERVER_NAME'}<BR>\n"; print "SERVER_PORT = $ENV{'SERVER_PORT'}<BR>\n"; print "SERVER_PROTOCOL = $ENV{'SERVER_PROTOCOL'}<BR>\n"; print "SERVER_SOFTWARE = $ENV{'SERVER_SOFTWARE'}<BR>\n"; print "HTTP_ACCEPT = $ENV{'HTTP_ACCEPT'}<BR>\n"; print "HTTP_ACCEPT_ENCODING = $ENV{'HTTP_ACCEPT_ENCODING'}<BR>\n"; print "HTTP_ACCEPT_LANGUAGE = $ENV{'HTTP_ACCEPT_LANGUAGE'}<BR>\n"; print "HTTP_AUTHORIZATION = $ENV{'HTTP_AUTHORIZATION'}<BR>\n"; print "HTTP_CHARGE_TO = $ENV{'HTTP_CHARGE_TO'}<BR>\n"; print "HTTP_FROM = $ENV{'HTTP_FROM'}<BR>\n"; print "HTTP_IF_MODIFIED_SINCE = $ENV{'HTTP_IF_MODIFIED_SINCE'}<BR>\n"; print "HTTP_PRAGMA = $ENV{'HTTP_PRAGMA'}<BR>\n"; print "HTTP_REFERER = $ENV{'HTTP_REFERER'}<BR>\n"; print "HTTP_USER_AGENT = $ENV{'HTTP_USER_AGENT'}<BR>\n";
Note: Once again, to run this program on a Windows machine, remove the line #!/usr/local/bin/perl.
Place this code in a file called display.pl in your cgi-bin directory. You can then run it from your Web browser by a URL in the form http://www.robertm. com/cgi-bin/display.pl. Remember, www.robertm.com is specific to my machine. In its place you need to specify the domain name or IP address of the machine running your Web server. Also, try running this script with different Web browsers and on different machines, or maybe even create an HTML page that has a link to the script, and notice how the values of the environment variables change.
Standard Input
Under most circumstances, all the information your script needs will be contained in the environment variables. However, in some cases the Web server passes data to your CGI script by using standard input. When a Web browser requests a CGI script from a Web server with the request method of POST, which is most often used with forms in HTML, the user-provided data, if any, is sent via standard input. The Web server still assigns values to most of the environment variables discussed earlier. In fact, when the user-provided data is sent via standard input, you should always check the value of CONTENT_LENGTH before working with the data sent since the Web server does not send an EOF (End Of File) at the end of the data.
URL Encoding
Whether the Web server sends the user-provided data via standard input or by assigning it to the QUERY_STRING environment variable, the data is always sent as one long string of name/value pairs that is URL encoded. This encoding consists of changing all spaces to plus signs (+) and converting certain special characters into hexadecimal. Before working with the data, you need to decode the string and separate the name/value pairs
Each name/value pair consists of a field name and value separated by an equal sign (=). The field name is usually taken from the NAME attribute in one of the <INPUT>, <TEXTAREA>, or <SELECT> tags of an HTML form, and the value is usually data entered by the user submitting the form. The name/value pairs are separated by an ampersand sign (&). In Perl, a useful function called split separates a string into substrings at intervals that you specify. Below is an example of how you can split the name/value pairs. Each name/value pair is first placed into an array. Then the name and value are separated and placed into an associative array, with the name acting as the key and the value being assigned to the array element. By the way, an associative array is an array that is indexed by strings rather than integers. For associative arrays, the index is referred to as the key. So, for name{`first'}=Robert, the array is name, the key is first, and the value is Robert.
Listing 2.2 splits up the name/value pairs, but remember that the query string is URL encoded as well. You must decode the contents of the string in addition to splitting the name/value pairs. Listing 2.3 adds the code within the foreach loop that changes all equal signs (=) to spaces and replaces hexadecimal codes with their character equivalents.
Listing 2.2: Perl Code to Split Name/Value Pairs # This line places each name/value pair as a separate # element in the name_value_pairs array. @name_value_pairs = split(/&/, $user_string); # This loops over each element in the name_value_pairs # array, splits it on the = sign, and places the value # into the user_data associative array with the name as the # key. foreach $name_value_pair (@name_value_pairs) { ($name, $value) = split(/=/, $name_value_pair); # If the name value pair has already been given a value, # as in the case of multiple items being selected, then # separate the items with a " : ". if (defined($user_data{$name})) { $user_data{$name} .= " : " . $value; } else { $user_data{$name} = $value; } }
Listing 2.3: Perl Code to URL Decode User-Provided Data # This line changes the + signs to spaces. $user_string =~ s/\+/ /g; # This line places each name/value pair as a separate # element in the name_value_pairs array. @name_value_pairs = split(/&/, $user_string); # This loops over each element in the name_value_pairs # array, splits it on the = sign, and places the value # into the user_data associative array with the name as the # key. foreach $name_value_pair (@name_value_pairs) { ($name, $value) = split(/=/, $name_value_pair); # These two lines decode the values from any URL # hexadecimal encoding. The first section searches for a # hexadecimal number and the second part converts the # hex number to decimal and returns the character # equivalent. $name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/ge; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/ge; # If the name value pair has already been given a value, # as in the case of multiple items being selected, then # separate the items with a " : ". if (defined($user_data{$name})) { $user_data{$name} .= " : " . $value; } else { $user_data{$name} = $value; } }
You might wonder why the entire string is not hexadecimal URL decoded before it is split, even though plus signs are replaced with spaces at this stage. Some of the special characters that are converted to hexadecimal when URL encoding takes place are the +, &, and = signs. If these were changed before the string was split or plus signs were converted to spaces, any of these special characters could alter where a value is split or what value is actually displayed, causing incorrect results. This is why hexadecimal encoding is done. It enables your CGI script to distinguish between when a symbol is typed by the user or is being used for a special purpose. For example, the name/value separator & would not be changed to hexadecimal whereas a & symbol typed by the user would be changed.The code samples in both Listings 2.2 and 2.3 are not complete, ready-to-execute CGI scripts. Rather, they are just examples of the lines of code that perform the URL decoding. Chapter 4 will incorporate these lines of code into a subroutine for use within the examples in this book.
Returning the results
Whenever a CGI script is called, it needs to return a result to the Web server, which then sends it to the Web browser that requested it. A CGI script also has the option of bypassing the Web server and returning the result directly to the Web browser. Whether the results are being sent to the Web server or directly to the Web browser, the CGI script must specify a valid header.
When a CGI script completes execution, it typically sends its results back to the Web server via standard output. The Web server receives the results, formats the proper HTTP response header, and returns all of the data to the Web browser. The first thing the CGI script must return to the browser is a parsed header.
Parsed Headers
Every CGI script must precede any data returned to the Web server with a parsed header. A parsed header is the lines of code output by your CGI script that get parsed by the Web server. This parsed header is in the same format as an HTTP header and can contain any of the CGI variable names listed in Table 2.1. Parsed headers must always be immediately followed by a blank line. Any lines in the parsed header that are not directives to the Web server are sent back to the Web browser as part of the HTTP response header. The current version of CGI, version 1.1, specifies three server directives, which are shown in Table 2.3.
Table 2.3: Server Directives for Parsed Headers Directive
Meaning
Content-type Specifies to the Web server the MIME type of the data being returned by the CGI script. Location Contains either the virtual path or the URL of a document that your CGI script wants returned to the Web browser requesting your script. Status Returns to the Web server an HTTP status line, which will then be returned to the Web browser. Status lines consist of a three-digit status code and the reason string. Examples are 404 Not Found and 403 Forbidden.
Here's an example of a parsed header being returned in a CGI script:#!/usr/local/bin/perl print "Content-type: text/html\n\n";
Bypassing the ServerMost Web servers allow you to send the output from your CGI script directly back to the Web browser rather than through the Web server. For the Netscape Communications server, you can activate this feature by preceding the name of your CGI script with nph-.
When your CGI script sends its output directly back to the Web browser, it has to specify a nonparsed header that must contain the proper HTTP response headers. Table 2.4 lists the HTTP response headers
Table 2.4: HTTP Response Headers HTTP Response Header
Meaning
ALLOWED Specifies to the requesting browser which request methods are allowed. Examples are GET, HEAD and PUT. CONTENT-ENCODING Specifies which encoding method is used. Examples are x-zip, x-stuffit, and x-tar. CONTENT-LANGUAGE Specifies the language the returning document is in. An example is en, which is English in one of its forms. CONTENT-LENGTH Specifies the size in bytes of the returning data. CONTENT-TRANSFER-ENCODING Specifies the encoding of the data between the Web server and the Web browser. The default is binary. CONTENT-TYPE Contains the type of the data being transferred. Examples are text/html and image/gif. COST Will contain the cost of the retrieval of the object being requested. The format of this header has not yet been specified. DATE Contains a creation date of the requested object in a valid ARPANET format. DERIVED-FROM Can contain a version number for the requested object, allowing for version control of editable documents. EXPIRES Contains an expiration date for the requested information, after which the document should be retrieved again. This header is used primarily for caching mechanisms and is in an ARPANET date format. LAST-MODIFIED Contains the date when the requested object was last modified. This header is in an ARPANET date format. LINK Holds information about the document being returned. You can use it to specify information such as the inclusion of another URL within the returned document or the creator of the returned object. MESSAGE-ID Contains a unique identifier for the HTTP message. PUBLIC Fairly similar to the ALLOW response header. However, it specifies the request methods that anyone can use, not just the requesting browser. Examples for this header are GET, HEAD, and TEXTSEARCH. TITLE Contains the title of the document being returned. For an HTML file, this is equivalent to the value contained within the <TITLE></TITLE> tags. URI Gives the URI (uniform resource identifier) where the requested object can be found. This will not always be the URL the user entered in the Web browser requesting the returned object. However, it will point to an object that should be the same as the one being returned, with some degree of variance. An example is http://www.robertm.com/Group-one/section1.htmlvary= language, version which gives a URI with the same document, which might vary in language or version. VERSION Defines the version of an object that can be changed. Its format is currently undefined.
You do not need to provide every HTTP response header to have a valid nonparsed header. For example, a CGI script with a valid nonparsed header would look like this:#!/usr/local/bin/perl print "HTTP/1.0 200 OK\n"; print "Server: Netscape-Communications/1.1\n"; print "Content-type: text/html\n\n";