Chapter 10 Site Administration

by Shelley Powers

CONTENTS

Working with Web Server Log Files
- Generating HTML Output from a Log File
- Reviewing the AccessWatch Log Analyzer
Understanding File Maintenance
Understanding Robots and the Robot-Exclusion Standard
Configuring Some Common Web Servers
Examining Some Other File- and Site-Administration Issues
From Here

Previous chapters discussed creating both static and dynamic content as well as security. Two facts arise from this file creation: Files are generated, and maintenance must be maintained on your site if it is to perform at peak efficiency.

Depending on your Web server, you need to perform some configuration when you install the server, and you need to perform periodic maintenance. A detailed description of what is involved is beyond the scope of this book, but some maintenance tasks can be performed by Perl programs. This chapter discusses configuring the Apache Web server (from the Apache Group) and NCSA's httpd Web server.

The most useful tool for understanding how and when your Web site pages and applications are being accessed is the log file that your Web server generates. This log file can show, among other things, which pages are being accessed, by whom (usually, in a generic sense), and when.

Additionally, if your site runs Common Gateway Interface (CGI) or other applications, you most likely need applications that remove orphaned files from processes that the Web page reader began but never finished.

Your site may be visited by something other than humans. Web robots-also known as Web bots, spiders, and wanderers-may visit your site. This technique is how search engines, such as WebCrawler (http://www.webcrawler.com), search for sites to add to their collections. Sometimes, these visitors take a quick peek around and leave quietly, and sometimes, they don't.

Working with Web Server Log Files

Each Web server provides some form of log file that records who and what accesses a specific HTML page or graphic. A terrific site called WebCompare (http://www.webcompare.com/) provides an overall comparison of the major Web servers. From this site, you can see which Web servers follow the CERN/NCSA common log format, which is detailed next. In addition, you can find out which sites can customize log files or write to multiple log files. You may be surprised by the number of Web servers that are on the market.

Most major Web servers provide certain information in their access log files. You can find the format for this information at http://www.w3.org/pub/WWW/Daemon/User/Config/Logging.html#common-logfile-format. That site contains the following line:


remotehost rfc931 authuser [date] "request" status bytes

The items listed in the preceding line are:

remotehost: host name or IP address of remote host
rfc931: remote log name of the user
authuser: user name (if authentication occurs)
date: date and time of request
request: full HTTP request, including any data associated with the request
status: HTTP status code
bytes: content length of requested document

Following is an example of a log-file listing from a log file generated by O'Reilly's WebSite Web server in Windows NT:


204.31.113.138 www.yasd.com - [03/Jul/1996:06:56:12 -0800] 

"GET /PowerBuilder/Compny3.htm HTTP/1.0" 200 5593

Figure 10.1 shows an example of a log file generated by the Apache Web server.

Figure 10.1 : This log file was created by the NCSA Apache Web server.

Both Web servers provide the date and time when the HTTP request was made, the HTTP request, and the status. The first example does not have access to DNSLookup, which would pull up the DNS alias for the IP address, if available. The second example shows the DNS alias. In addition, the first example displays the site that is accessed (in this case, www.yasd.com). The second example would display the remote log name if the Web server could access it; because it cannot, it displays unknown. Finally, because none of the HTTP requests were made to a secure site, there is no authorized user name that would have displayed where the dash (-) is.

Each HTTP request is logged. The first request is for an HTML document, and the second is for a JPEG-format graphic. If a site has several graphics and pages, the log file can get rather large. In addition, pulling useful information from the log file is difficult if you try to read the file as it is.

To pull useful information out of log files, most people use one of the existing log-file analyzers or create their own. These utilities can generate a text-file analysis based in HTML and even display results in graphic form. A good place to look for existing freeware, shareware, or commercial log-analysis tools is the Yahoo subdirectory http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/.

The following two sections provide samples of Perl code that can access a log file and generate two types of output: an HTML document and a VRML (Virtual Reality Modeling Language) document.

Generating HTML Output from a Log File

Regardless of the type of output, you must open a log file and read in the entries. You can read an entry into one variable for processing, or you can split the entry into its components. To read an entry as-is in Perl, you use the following code sample:


open(LOG_FILE, "< " . $file_name) || die "Could not open log file.";

foreach $line(<LOG_FILE>) {

     // do some processing

  .

  .

  .

}

This code opens the log file for reading and accesses the file one line at a time, loading the line into the variable $line. To split the contents of the line, use the following code, which is the same as the preceding code sample except for the addition of a split command:


open(LOG_FILE, "< " . $file_name) || die "Could not open log file.";

foreach $line(<LOG_FILE>) {

     // do some processing

     ($dns, $rfcuser,$authuser,$dt1,$dt2,$commethod,$comnd,$stat,$lnth) = split(' ',$line);

  .

  .

}

The preceding code splits the access log entry in either of the log-file examples shown in "Working with Web Server Log Files" earlier in this chapter. You can also load the entry elements directly into an array, as follows:


open(LOG_FILE, "< " . $file_name) || die "Could not open log file.";

foreach $line(<LOG_FILE>) {

     // do some processing

     ($dns, $rfcuser,$authuser,$dt1,$dt2,$commethod,$comnd,$stat,$lnth) = split(' ',$line);

  .

  .

}

When you have access to the log entries, you can use the values to generate HTML, based on several factors. If you want to generate an HTML file that lists the number of accesses by document, you can code something like the following:


#!/usr/local/bin/perl

.

.

use CGI;

$query = new CGI;

.

.

open(LOG_FILE, "< " . $file_name) || die "Could not open log file.";

foreach $line(<LOG_FILE>) {

     // do some processing

     ($dns, $rfcuser,$authuser,$dt1,$dt2,$commethod,$comnd,$stat,$lnth) = split(' ',$line);

  .

  .

     if (index($comnd, "somedoc.html") >= 0) {

          $counter++;

       .

       .

     }

}

Then you can output the variables by using standard HTML output procedures, as follows:


print $query->header;

print $query->start_html('The Access Page');

print $query->h1("Accesses Per Page");

.

.

print "<p> Page somedoc.html was accessed " . $counter ." times";

An alternative method that can provide some graphics output is to print an asterisk (*) for each access. This method provides output similar to that shown in figure 10.2.

Figure 10.2 : This figure shows the log-file analysis results.

To do something like creating a text-based graphic, you should modify the code to output the results by using an HTML table, as in the following example:


print $query->header;

print $query->start_html('The Access Page');

print $query->h1("Accesses Per Page");

.

.

print "<table cellpadding=5>";

print "<tr><td> somedoc.html </td><td>";

for (i = 1; I <= $counter; i++) {

     print "*";

     }

print "</td></tr>";

.

.

print "</table>";

Reviewing the AccessWatch Log Analyzer

Several excellent log-analysis tools, written in a variety of programming languages, are available free or for a small fee. A particular favorite of mine is AccessWatch, by Dave Maher. Access Watch-a simple-to-use, easy-to-understand Perl application that provides sophisticated output with minimal complex, convoluted coding-is accessible on the CD-ROM that comes with this book. This tool is a favorite of mine not only because of its unusual and colorful output (see figs. 10.3, 10.4, and 10.5), but also because of how well the author documented the installation and configuration procedures.

Figure 10.3 : This figure shows AccessWatch's summary statistics.

Figure 10.4 : This figure shows AccessWatch's hourly statistics.

Figure 10.5 : This figure shows AccessWatch's page demand.

AccessWatch is not a CGI application; it is meant to be run manually or set to run as a cron job (more on cron later in this chapter). The application generates an HTML document called index.html, which can then be accessed with a Web browsing tool. AccessWatch analyzes the current day's accesses and provides statistics such as the number of accesses by hour and a projection of the total count for the day based on previous access patterns. In addition, the application displays a graphic representing the number of accesses for each target file; you can display the detailed access information, if you want.

One innovative aspect of AccessWatch is the graphics. Other applications use one technique or another to generate a graphics-based log analysis. The program 3DStats, written in C, generates VRML commands to create a 3-D model of log accesses. You can access this application at http://www.netstore.de/Supply/3Dstats/. Another program, Getgraph, is a Perl application that uses tools such as GIFTrans and gnuplot to create GIF files for display. You can find Getgraph at http://www.tcp.chem.tue.nl/stats/script/. Other log-analysis tools that provide graphical output are gwstat (http://dis.cs.umass.edu/stats/gwstat.html), which uses Xmgr; and Raytraced Access Stats (http://web.sau.edu/~mkruse/www/scripts/access3.html), which uses the POV-Ray raytracer.

AccessWatch creates small GIF files that form the bars of the display. The application includes a subroutine that generates the HTML to display the appropriate GIF file as a vertical bar (see Listing 10.1) or as a horizontal bar (see Listing 10.2).

Listing 10.1 Displaying a GIF File as a Vertical Bar (accesswatch.pl: PrintVertBar)


#----------------------------------------------------------------------------#

#  AccessWatch function - PrintBarVert

#    Purpose  : Prints a vertical bar with height as specified by argument.

#----------------------------------------------------------------------------#

sub PrintBarVert {

    local($pct) = $_[0];

    local($colorbar) = $vertbar{$_[1]};



    local($scale) = 0;

    $scale = $pct/$stat{'maxhouraccess'} * 200 if ($stat{'maxhouraccess'});



    print OUT "<IMG SRC=\"$colorbar\" ";

    printf OUT ("HEIGHT=%d WIDTH=10 BORDER=1 ALT=\"\">", $scale);

}

Listing 10.2 Displaying a GIF File as a Horizontal Bar (accesswatch.pl: PrintBarHoriz)


#----------------------------------------------------------------------------#

#  AccessWatch function - PrintBarHoriz

#    Purpose  : Prints a horizontal bar with width as specified by argument.

#----------------------------------------------------------------------------#

sub PrintBarHoriz {

    local($pct) = $_[0];

    local($colorbar) = $horizbar{$_[1]};

    local($scale) = 1;



    $scale = ($pct*8)/log $pct + 1 if ($pct > 0);

    print OUT "<IMG SRC=\"$colorbar\" ALT=\"";

    print OUT "*" x ($pct/3 + 1) . "\" ";

    printf OUT ("HEIGHT=15 WIDTH=%d BORDER=1>", $scale);

}

To use the PrintBarVert subroutine, the height is calculated and passed as an argument. This process is demonstrated in the subroutine PrintTableHourlyStats (see Listing 10.3), which prints the hourly statistics.

Listing 10.3 HTML Generating the Hourly Access Statistics (accesswatch.pl: PrintTableHourlyStats)


#----------------------------------------------------------------------------#

#  AccessWatch function - PrintTableHourlyStats

#    Purpose  : Prints bar graph of accesses over the course of the current

#                   day. Thanks very much to Paul Blackman for his work on

#                   this function.

#----------------------------------------------------------------------------#

sub PrintTableHourlyStats {



local($hourBar) = "img/hourbar.gif";

local($hour, $pct);



    print OUT <<EOM;

<TABLE BORDER=1 WIDTH=100%>

<TR><TH COLSPAN=3><HR SIZE=5>Hourly Statistics<HR SIZE=5></TH></TR>

<TR>

EOM

    print OUT "<TD ROWSPAN=11>";

    foreach $hour ('00'..'23') {

     if ($stat{'hr'.$hour} > 0.9*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 9);

     }

     elsif ($stat{'hr'.$hour} > 0.8*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 8);

     }

     elsif ($stat{'hr'.$hour} > 0.7*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 7);

     }

     elsif ($stat{'hr'.$hour} > 0.6*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 6);

     }

     elsif ($stat{'hr'.$hour} > 0.5*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 5);

     }

     elsif ($stat{'hr'.$hour} > 0.4*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 4);

     }

     elsif ($stat{'hr'.$hour} > 0.3*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 3);

     }

     elsif ($stat{'hr'.$hour} > 0.2*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 2);

     }

     elsif ($stat{'hr'.$hour} > 0.1*$stat{'maxhouraccess'}) {

         &PrintBarVert($stat{'hr'.$hour}, 1);

     }

     elsif ($stat{'hr'.$hour} > 0) {

         &PrintBarVert($stat{'hr'.$hour}, 0);

     }

     else {

         &PrintBarVert(0, -1);

     }

    }



    print OUT <<EOM;

<BR>

<IMG SRC="$hourBar" WIDTH=288 HEIGHT=22 BORDER=0 HSPACE=0 VSPACE=0 ALT="">

</TD>

<TD COLSPAN=2><TABLE BORDER=1 WIDTH=100%>

<TR><TH ALIGN=RIGHT>Avg Accesses/Hour</TH><TD ALIGN=RIGHT>$stat{'accessesPerHour'}</TD></TR>

<TR><TH ALIGN=RIGHT>Max Accesses/Hour</TH><TD ALIGN=RIGHT>$stat{'maxhouraccess'}</TD></TR>

<TR><TH ALIGN=RIGHT>Min Accesses/Hour</TH><TD ALIGN=RIGHT>$stat{'minhouraccess'}</TD></TR>

<TR><TH ALIGN=RIGHT>Accesses/Day</TH><TD ALIGN=RIGHT>$stat{'accessesPerDay'}</TD></TR>

</TABLE></TD></TR>

EOM



foreach $pct (0..9) {

     $img = 9 - $pct;

     print OUT "<TR><TD ALIGN=LEFT><IMG SRC=\"$vertbar{$img}\" 

           HEIGHT=8 WIDTH=10 BORDER=1 ALT=\"\"> &gt ";

     printf OUT ("%d%%</TD>", (9 - $pct)*10);

     printf OUT ("<TD ALIGN=RIGHT>%d accesses</TD></TR>\n", 

           (1 - $pct/10) * $stat{'maxhouraccess'});

    }



    print OUT <<EOM;

</TABLE><P>

EOM



}

The value of $hour that is passed to the PrintBarVert subroutine is captured in the RecordStats subroutine, shown in Listing 10.4.

Listing 10.4 RecordStats Stores from One Log Input Line (accesswatch.pl)


#----------------------------------------------------------------------------#

#  AccessWatch function - RecordStats

#    Purpose  : Takes a single access as input, and updates the appropriate

#                   counters and arrays.

#----------------------------------------------------------------------------#

sub RecordStats {

    #tally server information, such as domain extensions, total accesses,

    # and page information



    local($hour, $minute, $second, $remote, $page) = @_;



    $remote =~ tr/[A-Z]/[a-z]/;



    if ($remote !~ /\./) { $remote .= ".$orgdomain"; }

      #takes care of those internal accesses that do not get fully

      # qualified in the log name -> name.orgname.ext

    local($domainExt) = &GetDomainExtension($remote, 1);



    $stat{'accesses'}++;

    $domains{$domainExt}++;

    $hosts{$remote}++;

    $pages{$page}++;

    $stat{"hr".$hour}++;



    push (@accesses, "$hour $min $sec $remote $page") if $details;



}

The rest of the code for this application is on the CD-ROM that comes with this book, in the zipped file ACCESSWATCH_TAR.GZ. You can open the file in UNIX and use WinZip in Windows 95 and NT.

Understanding File Maintenance

Unless a Web site is very simple, containing only one level of HTML documents and no CGI or other applications, you need to establish procedures and probably create code for file maintenance. The preceding section demonstrated some techniques for analyzing the log files that are appended by the Web server. If an application has CGI applications that generate file output, you also need to manage those files, in addition to any database files with which the site may interact.

In Chapter 7 "Dynamic and Interactive HTML Content in Perl and CGI," you learned how to create a simplified version of a shopping-cart application. One side effect of this application is the creation of a temporary file to hold the contents of the shopping cart while the cart user is accessing items. When the user finishes the shopping process, the file is deleted. What happens if the shopping-cart user exits the site before reaching the finishing stage? The way that the application is written, it would leave this file on the system, which will eventually fill any free space allocated for the file.

Additionally, a CGI application may create a file that needs to be moved to a protected subsite for other forms of processing. The CGI application cannot move the file, because it could be running under the standard user name nobody-the user that most Web servers assign for Web-page access. This "user" does not have permission to move a file to a restricted area.

The most popular way to handle these types of file-management issues is to use a scheduler that performs maintenance activities at predefined times. In the UNIX environment, this daemon is cron. (A version of cron also is available for the Macintosh.) In Windows NT, you can use at. Alternatively, you can use NTCRND21, which is available at http://www.omen.com.au/Files/disk12/a04fa.html.

Using the UNIX version as an example, the site administrator can access or create a shell script that will access the date and time when a file was last accessed. (If the file is older than a specified age, the script removes it.) In the case in which the file is being moved, the script could access a particular subdirectory; move its contents (or only the files that have a certain extension); and kick off another application that will process them when they have been moved.

After you create the script, you need to set it up as a cron job. In UNIX, you accomplish this task by using the crontab, batch, or at command. The crontab command schedules a job to be run at a regular time for every specified period-such as once a day, week, month, or year. The at and batch commands are for batch jobs and are not used as commonly as crontab is.

For more information on schedulers, check your operating-system documentation, and check with the system administrators at your site.

Understanding Robots and the Robot-Exclusion Standard

A Web robot (also known as a wanderer or spider) is an automated application that moves about the Web, either on your local site or in a broader domain, by accessing a document and then following any URLs that the document contains. A well-known example of this type of robot is WebCrawler, which traverses the Web to add documents to its search engine.

Robots can be handy little beasties; they can perform functions such as testing the links in all the HTML documents on a specific Web site and printing a report of the links that are no longer valid. As a Web page reader, you can understand how frustrating it can be to access a link from a site, only to get the usual HTTP/1.0 404 Object Not Found error.

Robots also can be little nightmares if you have one that is not well written or intentionally not well-behaved. A robot can access a site faster than the site can handle the access and overwhelm the system. Or a robot can get into a recursive loop and gradually overwhelm a system's resources or slow the system until it is virtually unusable.

In 1994, the participants in the robots mailing list (robots-request@nexor.co.uk) reached a consensus to create a standard robot-exclusion policy. This policy allows for a file called ROBOTS.TXT, which is placed at the local URL /robots.txt. The file lists a user agent (which is a particular robot) and then lists the agent's disallowed URLs. The following forbids all robot entry to any site whose URL begins with /main/:


# robots.txt for http://www.somesite.com/



User-agent: *

Disallow: /main/

When the preceding format is used, any robot that honors the robot-exclusion standard knows that it cannot traverse any site whose URL begins with /main/.

Following is an example that excludes all robots except a particular robot with the user agent someagent:


# robots.txt for http://www.somesite.com/



User-agent: *

Disallow: /main/



# someagent

User-agent: someagent

Disallow:

The preceding code forbids entry to /main/ to any robot that honors the robot-exclusion standard except someagent. Using the term disallow with no URL would remove any previous disallow statements.

Finally, to forbid access to any robot that honors the robot-exclusion standard, you would use the following:


# go away

User-agent: *

Disallow: /

You can see when a robot that honors the robot-exclusion standard accesses your site, because you will have a recorded HTTP entry similar to the following:


204.162.99.205 www.yasd.com - [04/Jul/1996:15:30:43 -0800] "GET /robots.txt HTTP/1.0" 404 0

This entry is from an actual log file. Using the Windows Sockets Ping client application (which you can download from http://www.vietinfo.com/resource/html/networks.html), I found that the robot was from the DNS alias backfire.ultraseek.com. From my browser, I accessed http://www.ultraseek.com/ and found that the company maintains the search engine of InfoSeek, which is available at http://www.infoseek.com/. The fact that the robot attempted to access the ROBOTS.TXT file shows that this robot program is complying with the no-robots exclusion standard, and because my site performance has never degraded when this robot visits, I can also assume that it is a well-behaved robot.

Following is another entry in the log file for the same month:


204.62.245.168 www.yasd.com - [11/Jul/1996:19:33:47 -0800] "GET /robots.txt HTTP/1.0" 404 0

Again using the Ping program, I found that the IP address had the DNS alias crawl3.atext.com. Using this alias as a URL, I accessed http://www.atext.com and found that the robot belongs to the people who bring us the Excite search engine (http://www.excite.com/). The people at Excite also have a nice, clean, and easy-to-traverse Web site and maintain city.net, a knowledge base of information about communities around the world (http://www.city.net/).

In the past few paragraphs, I have mentioned those robots that comply with the robot-exclusion standard. This standard is not enforced. A robot does not have to access this ROBOTS.TXTfile.

Following are some of the well-known robots that support the exclusion standard:

The Ahoy Homepage Finder (http://metacrawler.cs.washington.edu:6060/doc/home.html) searches out personal home pages on the Web.
The FunnelWeb Search Agent (http://funnelweb.net.au/) provides searches in the South Pacific, including Australia and New Zealand.
The ht://Dig application (http://htdig.sdsu.edu/) provides search capabilities and an index for intranet use only.
The Hyper-Decontextualizer tool (http://www.tricon.net/Comm/synapse/spider/) takes the words that you enter and links them to some random site. The tool is fun to play with, and the site is worth a visit.
The InfoSeek robot (http://www.infoseek.com), mentioned previously in this section, is a general Web-search service tool.
The Inktomi Slurp (http://inktomi.berkeley.edu/) has an irresistible name and is a well-known search engine.

The list could go on and on. You can see these and other sites listed in the Web Robots Database (http://info.webcrawler.com/mak/projects/robots/active.html), which is maintained by WebCrawler.

Configuring Some Common Web Servers

Other site-administration and site-maintenance tasks have to do with the configuration of the Web server. You may need to create permissions for users, start servers running, kill processes that are causing problems, and perform other administrative tasks. Additionally, you need to perform upgrades not only on the Web server software, but probably also on all the supporting software (compilers, databases, and so on).

The following sections discuss some of the installation and configuration tasks involved in creating Web applications (particularly with Perl) for some common Web servers.

O'Reilly's WebSite

WebSite (http://website.ora.com/) is a popular Windows NT and Windows 95 32-bit Web server, due to its features and price. After you install this application, a tabbed property sheet allows you to configure such aspects as CGI access, user access, mapping, and logging. The details on setting up the site for CGI applications are provided in a book that comes with the installation software.

NCSA httpd

NCSA httpd is a popular UNIX-based Web server; you can download it for free from http://hoohoo.ncsa.uiuc.edu/. After installation, a subdirectory called CONF contains the configuration file HTTPD.CONF, which you access and change to customize the installation.

The configuration file contains several directives, including the following:

AccessConfig: global access configuration file. The access configuration file establishes what server-side includes (SSI) are allowed and which directive controls can be overridden by a subdirectory-based access control file.
ErrorLog: file in which the Web server will log errors.
AuthUserFile: file that contains users and passwords.
AuthName: file that sets the authorization realm.
AuthType: authorization type.
AuthGroupFile: user groups for authentication.
VirtualHost: multiple responses for multiple IP addresses.

Several other directives are allowable for the configuration file; you can review them at http://hoohoo.ncsa.uiuc.edu/docs/setup/httpd/Overview.html.

In addition to the server-configuration file, you'll find a file for configuring the server resources (SRM.CONF). This file contains the AddType directive, which adds MIME types for the server. Without this directive, the server does not know how to process a file that has a certain extension. Another important directive is the AddEncoding directive, which allows you to add file-encoding types, such as x-gzip encoding for the .GZ file extension.

Access for the Web server is maintained in the global access configuration file and in individual access files that are created for specific directories.

Setting up CGI for a NCSA httpd Web server is a simple process. First, you define which subdirectory contains scripts that will be executed by means of the ScriptAlias server directive. As documented by NCSA, the disadvantage of using this technique is that everyone would need to access and use the same subdirectory. In a virtual-host situation, this situation is highly unlikely.

Another technique for defining CGI executables is to define the MIME types for the CGI applications as executable by using the AddType resource directive and the extension, as in the following example:


AddType application/x-httpd-cgi .cgi

This code instructs the Web server to execute the file instead of attempting to read it when a Web page reader accesses a file that has this extension. My UNIX-based virtual Web site uses this technique.

To learn more about installing and configuring an NCSA httpd Web server, go to http://hoohoo.ncsa.uiuc.edu/docs/Overview.html.

Apache

The Apache Group's Web server, Apache, is available at http://www.apache.org/. Apache has a configuration setup that is very similar to that of NCSA httpd. Three files are used to configure the Apache Web server: SRM.CONF, ACCESS.CONF, and HTTPD.CONF.

The directives that Apache supports are listed in Table 10.1. Reading the directives in the table (and accessing more information about them at http://www.apache.org/docs/directives.html) is a demystifying experience. If you are a Web-application developer but not necessarily a Webmaster, the information in this table allows you to communicate with your Webmaster in a more meaningful manner. If NCSA httpd has a corresponding directive, Y appears in the NCSA httpd column; if not, N appears in that column; if unclear, ? appears in the column.

Table 10.1 Apache Web Server Configuration Directives

Directive	Purpose	NCSA httpd?
`AccessConfig`	Access configuration file name	Y
`AccessFileName`	Local access file name	Y
`Action`	Action to activate CGI script for a specific MIME type	?
`AddDescription`	Description of file if `FancyIndexing` is set	Y
`AddEncoding`	Allows the Webmaster to add file encoding types, such as x-gzip encoding	Y
`AddHandler`	Maps handler to file extension	?
`AddIcon`	Icon to display if `FancyIndexing` is set	Y
`AddIconByEncoding`	Icon to display next to encoded files with `FancyIndexing`	Y
`AddIconByType`	Icon for MIME type files if `FancyIndexing` is set	Y
`AddLanguage`	Adds file extension to describe the language content	?
`AddType`	Adds MIME-type extension	Y
`AgentLog`	File in which `UserAgent` requests are logged	Y
`Alias`	Allows alias path	Y
`allow`	Which hosts can access what directories	?
`AllowOverride`	Indicates whether local access file can override previous access file information	?
`Anonymous`	User name that is allowed access without password verification	?
`Anonymous_ Authorative`	Must match `Anonymous` directive, or access will be forbidden	?
`Anonymous_LogEmail`	Indicates whether anonymous password is logged	?
`Anonymous_NoUserID`	Can leave out user name and password	?
`Anonymous_ VerifyEmain`	Indicates whether verification of anonymous password occurs	?
`AuthDBMGroupFile`	DBM file containing user groups for authentication	?
`AuthDBMUserFile`	DBM file containing users and passwords	?
`AuthDigestFile`	Digest authentication file containing users and passwords	?
`AuthGroup`	File containing user groups for user authentication	Y
`AuthName`	Authorization realm name	Y
`AuthType`	Authorization type (basic only)	Y
`AuthUserFile`	File containing names and passwords for user authentication	Y
`BindAddress`	`*` for all IP addresses or a specific IP address	Y
`CacheDefaultExpire`	Expire time default if document is fetched via protocol that does not support expire times	?
`CacheGcInterval`	Time factor for determining whether files need to be deleted due to space constraints	?
`CacheLastModified Factor`	Factor for expiration calculation	?
`CacheMaxExpire`	Maximum time that cached documents will be retained	?
`CacheNegotiatedDocs`	Allows content-negotiated documents to be cached by proxy servers	?
`CacheRoot`	Directory for cached files	?
`CacheSize`	Space use for cache	?
`CookieLog`	Allows for Netscape cookies	?
`DefaultIcon`	Icon to display by default when `FancyIndexing` is set	Y
`DefaultType`	For handling unknown MIME types	Y
`deny`	Indicates which host is denied access to specific directories	?
`<directory>`	Encloses a group of directives	?
`DirectoryIndex`	Indicates which documents to look for when requester does not specify a document	?
`DocumentRoot`	Directory where httpd will serve files	Y
`ErrorDocument`	Document to display when a specific error occurs	N
`ErrorLog`	Log in which server will log errors	Y
`FancyIndexing`	Indicates whether fancy indexing is set for a directory	Y
`Group`	Group where server will answer requests	Y
`HeaderName`	File inserted at top of listing	Y
`IdentityCheck`	Enables logging of remote user name	Y
`ImapBase`	Default base for image-map files	?
`ImapDefault`	Sets default used in image maps if coordinates have no match	?
`ImapMenu`	Action if no valid coordinates are in image map	?
`IndexIgnore`	Files to ignore when listing a directory	Y
`IndexOptions`	Options for directory indexing	Y
`KeepAlive`	Number of requests to maintain persistent connection from one TCP connection	Y
`KeepAliveTimeout`	Seconds to wait for additional request	Y
`LanguagePriority`	Precedence of languages	?
`Limit`	Enclosing directive for HTTP method	?
`Listen`	Indicates whether to listen to more than one port or IP address	?
`LoadFile`	Links in files or libraries on load	?
`LoadModule`	Links to library and adds module	?
`Location`	Provides for access control by URL	?
`LogFormat`	Indicates format of log file	Y
`MaxClients`	Number of simultaneous client accesses	?
`MaxRequestsPerChild`	Number of requests for child server	?
`MaxSpareServers`	Number of idle child processes	Y
`MetaDir`	Directory containing meta information	?
`MetaSuffix`	File suffix of file containing meta information	?
`MinSpareServers`	Minimum number of idle child processes	?
`NoCache`	List of hosts and domains that are not cached by proxy servers	?
`Options`	Indicates which server features are available in which directory	?
`order`	Order of `allow` and `deny` directives	?
`PassEnv`	Passes CGI environment variable to scripts	?
`PidFile`	File in which the server records the process ID of the daemon	Y
`Port`	Network port to which the server listens	Y
`ProxyPass`	Maps remote proxy servers into local address space	?
`ProxyRemote`	Defines remote proxies to proxy	?
`ProxyRequests`	Indicates whether the server functions as a proxy server	?
`ReadmeName`	Name of file appended to end of listing	Y
`Redirect`	Maps old URL to new one	Y
`RefererIgnore`	Adds to strings to ignore in the headings of referers (sites that contain your site as a link and refer a Web page reader to your site)	Y
`RefererLog`	Name of file in which the server will log referer headings	?
`Require`	Indicates which users can access a directory	?
`ResourceConfig`	Name of file to read after HTTPD.CONF file	Y
`Script`	Action that activates `cgi-script` after specific method	?
`ScriptAlias`	Same as `Alias`; marks directory as `cgi-script`	Y
`ServerAdmin`	Sets the e-mail that the server includes in any error message	Y
`ServerAlias`	Alternative names for the host	?
`ServerName`	Host name of the server	Y
`ServerRoot`	Directory in which the server lives	Y
`ServerType`	Value of `inetd` or `standalone`	Y
`SetEnv`	Sets environment variable passed to CGI scripts	?
`SetHandler`	Forces matching files to be passed through a handler	?
`StartServers`	Number of child server processes started at startup	Y
`TimeOut`	Maximum that time server will wait for completion and receipt of a request	Y
`TransferLog`	File in which incoming requests are logged	Y
`TypesConfig`	Location of MIME-type configuration file	Y
`User`	User ID for which the server will answer requests	Y
`UserDir`	Sets real directory to use when processing a document for a user	Y
`VirtualHost`	Groups directives for a specific virtual host	Y
`XBitHack`	Controls parsing of HTML documents	?

Reading through this table, a Web-application developer can see several directives that affect what she or he can do, as well as what actions result in what behavior.

Examining Some Other File- and Site-Administration Issues

In addition to the files that are created and maintained by the Web server, the site needs to have access to the tools required for running the Web applications. On a UNIX site, this requirement could mean having access to C and C++ compilers and to any run-time libraries that your code may access, if you use either of those languages. If your site does not have a C++ compiler, you can access a GNU C++ compiler, g++, at ftp://ftp.cygnus.com/pub/g++/. To find out more about the GNU CC compiler, go to http://www.cl.cam.ac.uk:80/texinfodoc/gcc_1.html.

If you are using Perl (and I assume that you are, or this book would not have much appeal to you), you need to have access to the Perl executable, as well as to any Perl support files, such as CGI.pm. Appendix A, "Perl Acquisition and Installation," provides instructions on accessing Perl, and Appendix B, "Perl Web Reference," lists several sites from which you can access support files.

If you are working with any database access, you need to have the files and permissions to make this type of access. If your site does not contain a database, and you want to have access to a relational database access, visit the Hughes Technologies site at http://Hughes.com.au/. This site has a relational database engine, called mSQL or Mini SQL, that is very inexpensive and that has a large amount of support and utilities for the UNIX environment.

For Windows NT or Windows 95, you may need run-time files, such as those required by Microsoft's Visual Basic.

If you are using Java, you need to have the Java tools in the environment in which you will compile your application into byte code, but you do not need to have the Java development environment on your server. The byte code will be interpreted by a Java-compatible browser (such as Netscape's Navigator 2.0 and later, and Microsoft's Internet Explorer 3.0 and later). To download the Java Developers Kit, go to http://java.sun.com/products/JDK/.

If you are implementing security, you need to set up password security in whatever directories need to be secure. For some Web servers, additional security, installation, and configuration issues may arise. Check with your Webmaster on those issues.

From Here...

If you are planning to develop Web applications and are not the Webmaster for your site, you should discuss your options with the Webmaster before you do any coding. You also should view the configuration files, if possible, to better understand what options you have when programming.

In addition, if you are developing Web applications that create files, you must provide some mechanism to clean up after the applications, or your site will quickly get full. Using cron to schedule a job that periodically performs cleanup operations is an effective solution.

The log files that your Web server generates are your most useful tool for understanding which documents are being accessed and by whom. If you have a page that is rarely accessed, you may want to drop it or provide access to it from a more prominent location.

For more information, you can check the following chapters:

Chapter 7 "Dynamic and Interactive HTML Content in Perl and CGI," discusses source code that creates and accesses several dynamic files.
Chapter 8 "Understanding Basic User Authentication," discusses some of the security issues for users accessing the site.
Chapter 9 "Understanding CGI Security," discusses some of the security issues related to CGI applications.