Chapter 3

Tracking Hit Counts


CONTENTS

Now that you've started building "the ultimate site," you'll probably want to sate your ego by knowing exactly how many or how few people are bothering to stop by your little corner of the Web. If you're running a commercial site and get funding from advertisers, you'll need demographic information to prove to the people paying you that their money is well spent. In short, you'll need to track hits, or accesses, to your Web pages.

Server Access Logs

When someone's browser requests your Web page, that page is said to be hit, or accessed. Web servers track, in varying degrees, these hits and the information stores in the access log somewhere within the server's directory structure. From the information in the access log, you can identify what pages on your site have been requested, how many times, and by whom.

Hit or Miss?
In current Web parlance, the term hit has a somewhat broader definition than access. While hit corresponds to the loading of a page and all the embedded objects it may contain, access corresponds to the loading of one object within a page.
For example, if you have a page with 10 graphics, a hit on that particular page generates 11 access entries in the log-one for the page itself and one for each graphic. For highly complex pages with multiple graphics, frames, server-side includes, and so on, the total access count inflates by the number of individual objects within the page. If 100 people visit your page and you have 10 graphics on that page, then you have 100 hits, but 1,000 accesses. Naturally, from an advertising standpoint, talking about accesses makes a site sound much more popular than it actually is.
In previous years, this was a closely guarded secret, known only to the Web administrators. It allowed unscrupulous administrators and marketers to claim incredible activity on their sites just by listing the accesses instead of individual user visits. However, in recent years the word's gotten out, advertisers are more savvy, CGI scripters have gotten smarter, and access statistics are more in line with actual user visits.

The easiest way to count hits is to utilize the log files kept by your Web server. As with any other piece of software, what the file is named and where it's located varies. For example, the NCSA servers create a log file called access_log, which is, by default, stored in a logs/ subdirectory off the server's root. A sample of the information written to access_log is shown in listing 3.1, although the exact amount of information maintained can be configured. Consult the documentation for your server for more information on how to do this.


Listing 3.1  Sample from access_log
px1.mel.aone.net.au - - [24/Jun/1996:00:02:46 -0500]
"GET /~sjwalter/javascript/ HTTP/1.0" 200 5245
px1.mel.aone.net.au - - [24/Jun/1996:00:02:50 -0500]
"GET /~sjwalter/javascript/nn/index.html HTTP/1.0" 200 2793
px1.mel.aone.net.au - - [24/Jun/1996:00:02:55 -0500]
"GET /~sjwalter/javascript/nn/index2.html HTTP/1.0" 200 666
px1.mel.aone.net.au - - [24/Jun/1996:00:02:57 -0500]
"GET /~sjwalter/javascript/nn/que.html HTTP/1.0" 200 523
px1.mel.aone.net.au - - [24/Jun/1996:00:02:58 -0500]
"GET /~sjwalter/javascript/nn/index.html HTTP/1.0" 200 2293

As you can see, the amount of information available in the access log is rather extensive. The times specified for each page's request are close because this is a framed site. Each individual HTML document generates another access entry. If more of the log were printed here, you'd also see an access entry for each graphic displayed on each page.

One thing worth noting is the GET request in the first line:

GET /~sjwalter/javascript/ HTTP/1.0

No file is specified because the user accessed the main page using aliasing. If the server is configured for it, specifying a URL with only a path and no file name causes a default file, (often default.htm, index.htm, or index.html), to be handed back to the browser, like this:

http://www.visi.com/~sjwalter/

Because of this, if you want to scan the access log for hits, you need to look both for a specific page (your home page, for instance) and for an alias reference. To actually scan the log, use the UNIX grep command, which searches one or more files for a particular string. The general syntax for grep is:

grep pattern fileName

The string to search for is pattern and the file to search for is "fileName." By default, grep prints out every matching line it finds in the specified file with the addition of the -c parameter. You can instruct grep to suppress the normal output and just print a count of the matching lines. The simple CGI program that follows takes advantage of this and counts the number of home-page accesses in the specified directory (see listing 3.2). This listing assumes that the home page is named index.html.


Listing 3.2  A Simple Access Counter
#!/usr/local/bin/perl

$homePage = "/%7Esjwalter";
$logFile  = "/var/httpd/logs/access.log";

print "Content-type: text/html\n\n";
$num  = 'grep -c 'GET $homePageURL/ HTTP'      $logFile';
$num += 'grep -c 'GET $homePageURL/index.html' $logFile';
print "$num\n";

NOTE
In listing 3.2, the $homePage variable contains the sequence %7E. This is the ASCII equivalent of the tilde (~) and is used because special characters, like the tilde, are often encoded. This means that they convert to their ASCII numeric representation when written out.
This is different from escaping text, where the character is preceded by a backslash (\) in order to render it as normal text, instead of a Perl metacharacter.

To use this program, include it in your home page as a Server-Side include file, which listing 3.3 demonstrates. The result is a count similar to that shown at the bottom of figure 3.1.

Figure 3.1 : A simple CGI script can be implemented to create a text-based access counter.


Listing 3.3  Implementing a Simple Counter in HTML
<html>
<head><title>Welcome to My Home Page</title></head>
<body>
This page has been accessed
<!--#exec cgi="access1.cgi" -->
times.
</body>
</html>

Lack of efficiency is a problem with grepping the server's access log. It takes several seconds to read through the log file of an active server and most servers do not want to wait the additional seconds just to learn the access count. A more efficient technique is to maintain a separate file on the server that contains the access count.

More Efficient Counting

To circumvent the additional overhead of having to scan an entire server log file to compute each access, store the access count in a temporary file on the server. Then, using a slightly different script, follow these steps:

  1. Open the file.
  2. Read the current counter value from the file.
  3. Increment the counter value.
  4. Write the new value back to the file, overwriting the old value.
  5. Close the file.
  6. Write the new value back through the server so it appears on the page in the user's browser.

The process of opening, reading, writing, and closing a file is easily done with Perl. One additional factor, however, now comes into play. Because it's possible for more than one user to be accessing your site at any given time, it's possible for the counter file to be accessed from different connections simultaneously. If 10 users hit your page at the same time, each one would see the same access count and whoever is the last one to write the file out is the one who sets the value for the next user. This can skew the access count unless some method can signal each simultaneous access as a new hit, so some form of locking mechanism is necessary.

Sometimes referred to as a semaphore, a lock is a file that signals that something is happening. In the case of counter access, whoever opens the counter file first writes out a lock file. Anyone else attempting to access the counter has to wait until the lock file is deleted (a matter of seconds). Once the lock file disappears, anyone can quickly establish his or her own lock and then the process continually repeats up to the last access. This maintains an accurate access count. An example of simple locking is shown in listing 3.4.


Listing 3.4  Implementing File Locking
while (-e lockFile) {
   select(undef, undef, undef, 0, 1);
}

open(LOCKFILE, ">lockFile");
... # retrieve and increment the counter
close(LOCKFILE);
unlink(lockFile);

TIP
The while() loop in Listing 3.4 keeps checking for the existence of lockFile and, if found, meaning that someone else is accessing the counter log, performs a dummy buffer select, which is a relatively fast process. This means that the loop will cycle very quickly, minimizing the wait a user would encounter on a busy site.
However, because this loop executes so fast (and continuously), it also takes its toll on system response, especially if your site is extremely busy. An alternative loop that isn't as hard on the system would be:
while (-e lockFile) {
sleep(1);
}
This puts the process (the script, in this case) to sleep for 1 second, thus not using any system resources. While this loop executes more slowly (once each second), a one-second wait is unnoticed for the normal, modem-based user.

Graphic Counters

In the multimedia world of the Web, text-based access counters are somewhat bland. More often than not, the counters you find on pages are graphic, providing a more visually appealing display.

Converting your counter from a text-only to a graphic counter is simple. All you need is a collection of 10 image files-one for each digit from 0 through 9. Then, instead of printing out the number you read from the access log, you'd step through the number digit-by-digit and "print" out the corresponding image file. Listing 3.5 demonstrates a Perl fragment that handles this type of counter in a rather unusual way.


Listing 3.5  A Graphic Access Counter
...
# $count is assumed to have the current access count
print "<TABLE CELLPADDING=0 CELLSPACING=0 BORDER=0>";
print "<TR>";

for ($i=0; $i<length($count); $i++) {
   $digit = substr($count, i, 1);
   print "<TD><IMG SRC=\"$imagedir/$digit\.gif\"></TD>";
}

print "</TR></TABLE>";

What's different about listing 3.5 from many graphic counters is that it doesn't construct a bit map dynamically. Rather, it generates HTML code that formats the individual digits into a table, and lets the browser do the work of requesting the appropriate images from the server.

TIP
While this technique is not as efficient as having your Perl code generate the entire count as a single bit map, it permits you to do special visual tricks with your counter. For example, each individual graphic could be a small animated GIF.
If, however, you'd rather have Perl do all the work in generating your counter, you'll find examples of bit-map construction on the companion CD-ROM.

Generating Server Statistics with wusage

While access counters track the number of hits a page takes, often it's more valuable to be able to analyze the hit counts to look for patterns. What times of day are the most hits recorded, from what domains, what pages are hit the most in a site, and so on, are some examples. For those not interested in writing their own Web activity analysis program from the ground up, there is a wonderfully robust tool for generating server statistics-wusage. Available from http://www.boutell.com/wusage/, wusage generates weekly usage statistics of the following information:

Total server usage
Response to Isindex pages, or Index" usage
The top 10 sites by frequency of access
The top 10 documents accessed
A graph of server usage over many weeks
An icon version of the graph for your home page
Pie charts showing the usage of your server by domain

The only major requirement is that the program needs to be run on a periodic basis, usually once per week through a server maintenance script. An example of the output wusage can generate is shown in figure 3.2.

Figure 3.2 : The wusage statistics program generates a visual display of the activity of your Web server.

User-Specific Access Tracking

The techniques covered so far in this chapter deal with how many times your site has been accessed. You can also track how many times a particular user visited through the use of cookies. While server tracking relies on logs stored on the server, cookies are stored with the user's browser.

Baking Up a Batch of Cookies

Cookies (or Persistent Client State HTTP Objects) are a mechanism which both the server (and client, through JavaScript) stores and retrieves information from the client side of the connection. Every cookie has the following components:

CAUTION
There are several limits imposed on cookies:
  • A maximum of 20 cookies can be created for any given domain. Any attempt to set additional cookies will cause the oldest cookies in the file to be overwritten.
  • A given client (browser) can only store a maximum of 300 cookies. Like the 20-cookie domain limit, exceeding the 300-cookie limit will result in old cookies being overwritten.
  • Each cookie cannot exceed 4K (4096 bytes) in size.

Originally limited to server-side manipulation, JavaScript makes accessing cookies from within the browser a snap. The process for updating a cookie counter is similar to that of updating a server-side counter:

  1. Read the cookie value or assume a value of 0 if the particular cookie doesn't exist.
  2. Increment the counter value.
  3. Write the cookie back out.

The full source code for a cookie-based counter is available on the CD-ROM.

NOTE
Currently, client-side cookie manipulation (via JavaScript) is supported within Netscape Navigator, but not Internet Explorer.
For a trick that creates "pseudo-cookies" that work in both Navigator and Explorer, check out Chapter 27, "Power Scripting Toolkit."

From Here…

This chapter introduces the principles behind hits and access counters and how to implement them within your site. The techniques discussed can be extended in a variety of different ways, such as customizing your access monitoring to create a special page for the one millionth user access and using user-specific tracking, customize your site to display a special message (or link) with the user who visited for the 50th time.