Chapter 21 Tracking Users

Why Do We Need to Track Users?
The Essence of Web Marketing
Parsing Access Logs
- What Is an Access Log?
Environment Variables
Creating a Pseudo Access Log File
Logging Accesses
How to Implement Tracking CGIs
A Simple Web Counter
Calling counter.cgi
Locating Users Geographically
Cookies
Other Methods of Tracking Users
- Fingering Dial-Up Servers
The Ethics of Tracking Users
Accessing This Chapter Online
Summary

There are several different methods you can use to track users. They are

Parsing Access Logs: How to get at the information your Web server may already have.
Environment Variables: The information your browser is sending, without your knowledge.
Web Counters: The odometers you may have seen on some sites and how to make your own.
Logging Accesses: A more sophisticated means of counting users.
Locating Users Geographically: Where exactly is your audience located?
Cookies: A client/server method of definite user verification.

Why Do We Need to Track Users?

It's easy enough to set up a World Wide Web site for yourself or your organization and gauge its success or failure solely on the amount of response you get via e-mail, phone, or fax. But if you rely on so simplistic a tracking mechanism, you won't get anywhere near the whole picture. Perhaps your site is attracting many visitors, but your response form is hard to find, so very few of them are getting in touch with you. Perhaps many people find your Web site via unrelated searches on Internet search engines and promptly leave. Or perhaps you've optimized your site for Netscape, but the people most interested in your content are using ncSA Mosaic and can't view any of your in-line images! In any of these cases, you could spend a long time waiting for user responses while being totally in the dark about why you weren't getting any responses.

This illustrates why it's so important to track user information on a constant basis. You can gain valuable insights not only into who is accessing your site, but also how they're finding it and where they might have heard of you. Plus, there's the all-important question of the total number of users visiting your site.

How Search Engines Work

Search engines such as Alta Vista, WebCrawler, InfoSeek, Lycos, and Excite possess vast databases of information, cataloging much of the content on the World Wide Web. Not only is the creation of such a huge database a task more difficult than any group of people could manually accomplish, it's also necessary to update all of the information on an increasingly frequent basis. Thus, the creators of these services designed automatic "robots" that roam the Web and retrieve Web site information for inclusion in the database. While this deals with the speed problem quite nicely, there is a serious problem introduced by this automatic approach: Machines, even ones with so-called artificial intelligence software, are still nowhere near as good as humans at categorizing information (well, at least not into categories that make sense to humans!). When a search engine's robot visits a site, it incorporates all of the text on that site into its database for reference in subsequent user searches. This means that a word inadvertently placed in the text of your Web site can cause people to find your site via searches on that word, thinking that your site might have something to do with that word! Suppose that you've set up a Web site about gardening, and in it you include a personal anecdote about how much your dog loves being outdoors with you. Thousands of dog-lovers might find your site because of that reference to your dog, be surprised that the site is about gardening and not dogs, and promptly leave! There are many other problems associated with the way automatic search engines work, which you'll no doubt discover when your site is added to them.

The Essence of Web Marketing

With the incredible corporate interest in the World Wide Web in the past few years, tracking users helps us get closer to an answer to the most crucial question for most organizations getting on the Web: Does the Web really work? In other words, does their Web site attract visitors, and if so, do those visitors turn into customers? In other media, hard numbers are available as answers to these questions. Newspapers have circulation figures, radio has broadcast ranges, and television has Nielsen ratings. It's surprising how many Web sites have unmonitored access levels since more precise visitor information can be gained on the Internet than through any other medium.

There is one key advantage these other media have over the Web, however: access to demographic information. The reason that accurate demographics (for example, the makeup of the audience by age, sex, income, and so on) are much more readily available for these traditional media is because the level of market penetration is such that a representative sampling of the general population in that area can be extrapolated meaningfully to apply to your whole audience. With the Web, you have several problems in doing this:

Because people self-select their visit to your site, you can reach a very specialized audience, and a sampling of the general population would be completely inaccurate.
The international reach of the Web means that you could be attracting visitors from all over the world, making it much harder to do a survey.

Both of these problems mean that the only way you could get accurate demographics would be while people are actually visiting your Web site. This can come across as somewhat obtrusive, and people accustomed to browsing through Web sites at high-speed with little or no thought involved have to be given a very good incentive to spend the time to fill out a survey form for your benefit.

This means that it's all the more crucial to identify whatever hard numbers you can automatically, and this is where the idea of tracking users comes in.

Parsing Access Logs

This section deals with one of the fundamental methods of collecting demographic information about visitors to your Web site-the access log.

What Is an Access Log?

So where do we begin when trying to find out information about visitors to our site? How about on our Web server itself! It's mentioned earlier on in the book that HTTP, the HyperText Transfer Protocol, enables communication between your browser and the Web server via
a series of discrete connections that fetch the text of the Web page being retrieved, and then each one of the graphics on that page in sequence. Did you know that every single time one of these requests is made, a record of that request is written to a log file? Here is a sample of the contents of an access log, from the file access-log, produced by ncSA httpd.

ts17-15.slip.uwo.ca - - [09/Jul/1996:01:53:53 -0500] "POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1519 ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:22 -0500] "POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1954 ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:43 -0500] "POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1678 pm107.spots.ab.ca - - [09/Jul/1996:01:59:28 -0500] "GET /pics/asd.gif HTTP/1.0" Â304 0 b61022.dial.tip.net - - [09/Jul/1996:02:03:36 -0500] "GET /pics/asd.gif HTTP/Â1.0" 200 4117 slip11.docker.com - - [09/Jul/1996:02:03:49 -0500] "GET /rcr/ HTTP/1.0" 200 8751 slip11.docker.com - - [09/Jul/1996:02:04:17 -0500] "GET /rcr/guest.html HTTP/Â1.0" 200 2984 slip11.docker.com - - [09/Jul/1996:02:05:01 -0500] "GET /rcr/store.html HTTP/Â1.0" 200 34717 port52.annex1.net.ubc.ca - - [09/Jul/1996:02:05:09 -0500] "GET /pics/asd.gif ÂHTTP/1.0" 200 4117 slip11.docker.com - - [09/Jul/1996:02:06:01 -0500] "GET /rcr/regint.html HTTP/Â1.0" 200 19452

ncSA, CERN, and Apache httpd all produce access logs in very similar formats, and collectively they have the vast majority of Web server market share, so this section will deal with extracting information from those servers. Other Web servers may store information in a different format, and you should consult the documentation that comes with yours to learn how to read it.

Note

You may have heard of the HTTP keep-alive protocol, which allows for a continuous connection to be maintained between the Web server and the Web browser. This doesn't contradict the nature of the discrete connections in HTTP; there are still multiple fetches made from the Web server. The difference is that the connection isn't terminated and restarted between each one while retrieving information on the same Web page.

Now, let's take a look at some of the information that is provided in the access log. The lines all take on a standard format, and, in fact, the entire access log consists of nothing but lines like these. The format of the lines is as follows:

host rfc931 authuser [DD/Mon/YYYY:hh:mm:ss] "request" ddd bbbb "opt_referer" Â"opt_agent"

Here's a breakdown of the elements included in the lines:

host: Either the DNS name or the IP number of the remote client.
rfc931: Any information returned by identd for this person, or a dash (-) otherwise.
authuser: If user sent a userid for authentication, the username, or a dash otherwise.
DD: Day.
Mon: Month (calendar name).
YYYY: Year.
hh: Hour (24-hour format, the machine's timezone).
mm: Minutes.
ss: Seconds.
request: The first line of the HTTP request as sent by the client.
ddd: The status code returned by the server, or a dash if not available.
bbbb: The total number of bytes sent, not including the HTTP/1.0 header, or a dash if not available.
opt_referer: The referer field if given and if LogOptions is Combined.
opt_agent: The user agent field if given and if LogOptions is Combined

Note that the last two fields are not usually enabled on most systems, and thus our sample program won't process them. It's easy enough to modify it so that it does, however.

With a line not only for each Web page access, but in fact for each graphic on each Web page as well, you might be able to imagine why access log files can grow to become several megabytes in size very quickly. If your Web server has a limited amount of hard drive space, the access log's growth might even risk crashing it!

One solution to this problem is to delete the access log on a regular basis, after creating a summary of the information in it. So how exactly do you create a summary? Good question! this is where we get into our first program for this chapter, an httpd access log parser. The individual lines in the access log file, while providing a fairly detailed amount of information, aren't terribly useful when viewed in their raw form. However, they can be used as the basis for all kinds of reports you can create with software that summarizes the information into various categories. An example of such a program is included in Listing 21.1. Its output is shown in Figure 21.1., the Access Log Summary program. This program reads in the server access log file and generates an HTML document as output. The document summarizes all of the raw information presented in the access log into useful categories.

Figure 21.1: The output from the access log summary program.

Listing 21.1. Source code for the Access Log Summary program.

// accsum.cpp -- AccESS LOG SUMMARY PROGRAM // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // This program reads in the server access log file and generates an HTML // document as output. The document summarizes all of the raw information // presented in the access log into useful categories // // By Shuman Ghosemajumder, Anadas Software Development // // The categories it summarizes information for: // // * # of hits by domain // * # of hits by file path // * # of hits by day // * # of hits by hour // // GENERAL ALGORITHM // // 1. For each domain and file path, dynamically create a linked list // for each value, and add 1 to the hit count each time. // // 2. Create a linked list for each date, as well as each hour also. // // 3. Send the output to stdout. // IncLUDES *********************************************************** #include <stdio.h> #include <string.h> #include <stdlib.h> #include "linklist.h" // Linked List Header Files #include "linklist.cpp" // Linked List Source Code // DEFINES AND STRUCTURES ********************************************* #define MAX_STRING 256 #define DATE_STRING 32 #define HOUR_STRING 5 #define LOG_FILE "./test-access-log" typedef struct { char hostname[MAX_STRING]; int num_access; } sHOSTNAME; typedef struct { char filename[MAX_STRING]; int num_access; } sFILENAME; typedef struct { char hour[HOUR_STRING]; int num_access; } sHOUR; typedef struct { char date[DATE_STRING]; int num_access; } sDATE; // FUncTION PROTOTYPES ************************************************ int main(int argc, char *argv[], char *env[]); void ProcessLine( char * line ); void PrintOutput( void ); void InitAll(void); void DestroyAll(void); // GLOBAL VARIABLES *************************************************** sLINK * link_hostname; sLINK * link_filename; sLINK * link_hour; sLINK * link_date; // FUncTIONS ********************************************************** int main(int argc, char *argv[], char *env[]) { // Opens the access log file, parses the information into a linked list // internal data representation, then sends the summary of the output to // stdout. printf("Content-type: text/html\n\n"); printf("<HTML><TITLE>Access Log Summary</TITLE><BODY>\n"); printf("<H1>Access Log Summary</H1>\n"); FILE * fp; fp = fopen( LOG_FILE, "r" ); // open the access log file if( ! fp ) { printf("ERROR: Couldn't load log file!"); // abort painlessly } else // if able to load file... { char line[512]; InitAll(); for(;;) { // fetch lines until EOF encountered if( fgets( line, 511, fp ) == NULL ) break; ProcessLine( line ); // extract the important information } PrintOutput(); // send the output to stdout } DestroyAll(); printf("</ul></BODY></HTML>\n"); // end the HTML file return(0); // terminate gracefully } void InitAll(void) { // Initialize the heads for each of the linked lists InitHead( &link_hostname ); InitHead( &link_filename ); InitHead( &link_hour ); InitHead( &link_date ); } void DestroyAll(void) { // Destroy each of the linked lists (to free memory) DestroyList( &link_hostname ); DestroyList( &link_filename ); DestroyList( &link_hour ); DestroyList( &link_date ); } void ProcessLine( char * line ) { // Parse a single line of a standard web server access log sHOSTNAME hn; sFILENAME fn; sHOUR hr; sDATE dt; char * left, * right; sLINK * l; left = line; right = strchr( left, ' ' ); // find the first space if( ! right ) return; // bad entry memcpy( hn.hostname, left, right-left ); // get the first one *(hn.hostname + (right-left) ) = '\0'; l = FindNode( link_hostname, (void *) &hn, 0, strlen( hn.hostname ) ); if( ! l ) { hn.num_access = 1; AddNode( link_hostname, (void *) &hn, sizeof( sHOSTNAME ) ); } else { ((sHOSTNAME *) l->data)->num_access++; } left = right+1; // skip the space right = strchr( left, ' '); // find the next space (rfc931) if( ! right ) return; // bad entry left = right+1; // skip the space right = strchr( left, ' '); // find the next space (authuser) if( ! right ) return; // bad entry left = right+1; // skip the space right = strchr( left, ':'); // find the colon (date delimiter) if( ! right ) return; // bad entry left++; // skip the leading '[' memcpy( dt.date, left, right-left ); // get the first one *(dt.date + (right-left) ) = '\0'; l = FindNode( link_date, (void *) &dt, 0, strlen( dt.date ) ); if( ! l ) { dt.num_access = 1; AddNode( link_date, (void *) &dt, sizeof( sDATE ) ); } else { ((sDATE *) l->data)->num_access++; } left = right+1; // skip the colon right = strchr( left, ':'); // find the next colon (hour delimeter) if( ! right ) return; // bad entry memcpy( hr.hour, left, right-left ); // get the first one *(hr.hour + (right-left) ) = '\0'; l = FindNode( link_hour, (void *) &hr, 0, strlen( hr.hour ) ); if( ! l ) { hr.num_access = 1; AddNode( link_hour, (void *) &hr, sizeof( sHOUR ) ); } else { ((sHOUR *) l->data)->num_access++; } left = strchr( line, '\"' ); // find the beginning of the request if( ! left ) return; // bad entry right = strchr( left, ' ' ); // find the first space (Query Type) if( ! right ) return; // bad entry left = right+1; // skip the space right = strchr( left, ' ' ); // find the next space (filename with path) if( ! right ) return; // bad entry memcpy( fn.filename, left, right-left ); // get the first one *(fn.filename + (right-left) ) = '\0'; l = FindNode( link_filename, (void *) &fn, 0, strlen( fn.filename ) ); if( ! l ) { fn.num_access = 1; AddNode( link_filename, (void *) &fn, sizeof( sFILENAME ) ); } else { ((sFILENAME *) l->data)->num_access++; } } void PrintOutput( void ) { // Send the output from the program to stdout sLINK * l; l = link_date; printf("<H2>By Date</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sDATE *) (l->data))->date, ((sDATE *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n"); l = link_hour; printf("<H2>By Hour</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sHOUR *) (l->data))->hour, ((sHOUR *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n"); l = link_hostname; printf("<H2>By Hostname</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sHOSTNAME *) (l->data))->hostname, ((sHOSTNAME *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n"); l = link_filename; printf("<H2>By Filename</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sFILENAME *) (l->data))->filename, ((sFILENAME *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n");
}

This program makes use of linked lists, which aren't supported directly in C as associative arrays are in Perl. Thus, there are some support routines that are needed in order to make the program function properly, and they are included here, in Listings 21.2 and 21.3.

Listing 21.2. The linked list routine.

// linklist.h -- The Header file for the Linked List Routines // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // By Shuman Ghosemajumder, Anadas Software Development // STRUCTURES ********************************************************* typedef struct linked_list { struct linked_list * next; void * data; } sLINK; // LINKED LIST FUncTION PROTOTYPES ************************************ void InitHead( sLINK * * head ); void DestroyList( sLINK * * head ); int CountNodes( sLINK * head ); sLINK * GetNext( sLINK * l ); sLINK * AddNode( sLINK * head, void * data, int data_size );
sLINK * FindNode( sLINK * head, void * data, int offset, int data_size );

Listing 21.3. Source code for the linked list functions.

// linklist.cpp -- Linked List Functions // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // By Shuman Ghosemajumder, Anadas Software Development void InitHead( sLINK * * head ) { // Initialize the head pointer of a linked list *head = (sLINK *) malloc( sizeof(sLINK) ); // allocate memory if( ! *head ) { printf("Memory allocation error.\n"); exit(-1); } (*head)->data = NULL; // no data yet (*head)->next = NULL; // no next pointer yet } void DestroyList( sLINK * * head ) { // Destroy an entire linked list sLINK * l = *head; sLINK * temp; for(;;) &nb sp; // loop to destroy { if( l->data ) free( l->data ); // each node of the list if( l->next ) { temp = l; l = l->next; free( temp ); // thus freeing memory } else break; } *head = NULL; // destroy the head pointer } sLINK * AddNode( sLINK * head, void * data, int data_size ) { // Add a node to the linked list sLINK * next = head; sLINK * last; do { last = next; next = GetNext( next ); } while( next ); // go to the end of the list // next == NULL, therefore last == the last node if( last->data == NULL ) { next = last; } else { next = (sLINK *) malloc( sizeof(sLINK) ); if( ! next ) { printf("Memory allocation error.\n"); exit(-1); } last->next = next; } next->data = (void *) malloc( data_size ); if( ! next->data ) { printf("Memory allocation error.\n"); exit(-1); } memcpy( next->data, data, data_size ); next->next = NULL; return ((sLINK *) next); } int CountNodes( sLINK * head ) { // Return the total number of nodes in the linked list int count = 0; do { head = GetNext( head ); count++; } while( head ); return count; } sLINK * GetNext( sLINK * l ) { // Given one node of the list, return a pointer to the next node if it // exists, or NULL if it doesn't. if( l->next != NULL ) return ((sLINK *) l->next); else return NULL; } sLINK * FindNode( sLINK * head, void * data, int offset, int data_size ) { // Compare "data" to the value at "offset" in the data structure portion // of the linked list, and return a pointer to the node which contains // this value if there is one. for(;;) { if( head->data != NULL ) { if( memcmp( (char *) head->data + offset, (char *) data, data_size ) == Â0 ) { return ( (sLINK *) head ); } if( head->next ) head = head->next; else return NULL; } else { return NULL; } } }

This program is a good starting point, but ideally you'd like to be able to have it compiled automatically. As mentioned before, access logs are often several megabytes (some can be several hundred megabytes!) in size, so the idea of generating these kinds of statistics in real-time every time the user accesses the on-line summary page is unfeasible on most computer systems. The best solution is to have these summaries created in the background of the Web server on a regular basis, so users always get a reasonably current set of information and don't have to wait for several minutes while it processes the access log file. There's a UNIX program called crontab that allows you to schedule events (such as the execution of your program) in the background. Here's how it works. First, you need to ensure that you (and not the Web server process) has access to crontab; contact your UNIX admin to let him or her know of your requirement.

Caution

In general, the Web server process should have access to exactly what it needs access to-nothing more and nothing less. Remember that if a rogue user gains control of the Web server process (via a false crontab file or some other means), then he or she would be able to effectively execute privileged commands with total anonymity-something which is never a good situation on a computer system.

After you've set up your crontab access, you should edit your crontab file and add a line similar to the following:

* 06 * * * /usr/home/big/anadas/cgiunleashed/auto-make

You should read your system's man page for crontab to ensure that you have your crontab file set up correctly.

Now that you've got crontab set up, you'll need to have an access log summary program that produces a Web-viewable summary.

Environment Variables

The Web server's access log feature functions by recording information about the user who is visiting your server, which is sent from the user's own browser. While the information the access log records is very useful, it is by no means an exhaustive account of everything the browser "tells" the Web server about itself and the user.

Let's take a look at the output of the environment variables program first used in Chapter 12, "Imagemaps" (program is available on-line at http://www.anadas.com/cgiunleashed/imagemaps/exe/showenv.cgi):

SERVER_SOFTWARE=ncSA/1.5 GATEWAY_INTERFACE=CGI/1.1 DOCUMENT_ROOT=/usr/home/big/anadas REMOTE_ADDR=199.45.70.220 SERVER_PROTOCOL=HTTP/1.0 REQUEST_METHOD=GET REMOTE_HOST=tc220.wwdc.com QUERY_STRING= HTTP_USER_AGENT=Mozilla/3.0b5a (Win95; I) PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin:/usr/contrib/bin:/usr/X11/bin HTTP_CONNECTION=Keep-Alive HTTP_AccEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */* SCRIPT_NAME=/cgiunleashed/imagemaps/exe/showenv.cgi SERVER_NAME=www.anadas.com SERVER_PORT=80 HTTP_HOST=www.anadas.com SERVER_ADMIN=shuman@anadas.com

This is the complete set of environment variable information for the Web server process on this particular server, when a particular user accessed the script in question. Most of these variables are passed from the browser to the Web server, via the CGI interface. Note, however, that some of the variables are set entirely on the Web server's end, for the benefit of CGI programs that need to know additional information about their environment. So what do these environment variables mean?

SERVER_SOFTWARE: This indicates the actual Web server software, which in this case is ncSA httpd version 1.5.

GATEWAY_INTERFACE: This is the level of CGI compatibility supported by the server, which in this case is 1.1.

DOCUMENT_ROOT: This is also a server-set environment variable. It indicates the location of the root document for the Web server (http://www.anadas.com).

REMOTE_ADDR: This environment variable is passed by the browser and indicates the IP address of the browser's Internet connection.

SERVER_PROTOCOL: This environment variable is set by the browser and indicates the HTTP compatibility level.

REQUEST_METHOD: This environment variable is set by the browser according to the kind of query it has sent to the Web server. Normal document and file retrievals are classified as GET queries.

REMOTE_HOST: This environment variable is sent by the browser and indicates the hostname associated with its IP address, if applicable.

QUERY_STRING: This environment variable is set according to the information that is passed by the query. In the case of a GET query, the query string consists of whatever information is after the question mark (?) in the URL.

HTTP_USER_AGENT: This environment variable allows the browser to tell the server what its product name and version number are.

PATH: Every UNIX user has a path associated with his or her login, and the Web server process is no exception.

HTTP_CONNECTION: This environment variable is set by the Web browser to tell the server whether or not it supports a keep-alive connection.

HTTP_AccEPT: This environment variable allows the Web browser to tell the Web server the different data formats it accepts in-line (plug-ins not included).

SCRIPT_NAME: This environment variable is set by the Web server and identifies the script that is being run.

SERVER_NAME: This environment variable is set by the Web server and identifies the Web server's hostname.

SERVER_PORT: This environment variable is set by the Web server and identifies the port address the server is "listening to" for connections.

HTTP_HOST: This environment variable indicates the hostname of the Web server's host.

SERVER_ADMIN: This environment variable, set by the Web server, indicates the e-mail address of the Web server administrator.

AUTH_TYPE: If the server supports user authentication, and the script is protected, this is the protocol-specific authentication method used to validate the user.

REMOTE_USER: If the server supports user authentication, and the script is protected, this is the username they have authenticated as.

REMOTE_IDENT: If the HTTP server supports RFC 931 identification, this variable will be set to the remote username retrieved from the server.

DOCUMENT_NAME: The current filename.

DOCUMENT_URL: The virtual path to the document.

QUERY_STRING_UNESCAPED: The unescaped version of any search query the client sent, with all shell-special characters escaped with \.

DATE_LOCAL: The current date and local time zone. Subject to the timefmt parameter to the config command.

DATE_GMT: Same as DATE_LOCAL but in Greenwich Mean Time.

LAST_MODIFIED: The last modification date of the current document. Subject to timefmt like the others.

Note that not all of these variables appear on the sample output. This is because different servers and browser combinations created different environment variables. Netscape Navigator, Microsoft Internet Explorer, and many other Web browsers each put their own spin on environment variables, and either provide more environment variables or send richer information in the aforementioned variables. For example, Internet Explorer sends the current screen resolution in the browser-type environment variable. This allows dynamically generated Web pages to optimize their appearance for a particular screen size.

Can I Get E-Mail Addresses?

One of the questions most often puzzled over by CGI programmers is whether or not they can obtain a user's e-mail address. Creators of browser software are very sensitive to this issue, and the answer is, in most cases, no. There are certain browsers that pass along this information, at least to some extent.

Some browsers that return full e-mail address information are

ncSA Mosaic for Macintosh 2.0a17
ncSA Mosaic for Macintosh 2.0a8
MCom Netscape 0.9 beta (X, Mac, Windows)

A browser that returns the username is:

MCom Netscape 0.9 beta (X only)

The method by which environment variables are extracted in C is presented in Listing 21.4, which is essentially the C version of the showenv.cgi program.

Listing 21.4. Source code for the Web server environment variable printer.

// getenv.cpp -- Web Server Environment Variable Printer // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // This program displays all of the environment variables available to the // web server when a user accesses this program via the CGI interface // // By Shuman Ghosemajumder, Anadas Software Development #include <stdio.h> int main(int argc, char *argv[], char *env[]); int main(int argc, char *argv[], char *env[]) { int count; printf("Content-type: text/html\n\n"); printf("<HTML><TITLE>Environment Variables</TITLE><BODY>\n"); printf("<H1>Web Server Environment Variables</H1><ul>\n"); for(count=0;env[count];) { printf("<B>Var %d.</B> %s<BR>\n", count, env[count++] ); } printf("</ul></BODY></HTML>\n"); return(0); // exit gracefully }

Creating a Pseudo Access Log File

Having the ability to parse ready-made server access logs is wonderful, but what if you don't have access to those logs? As long as you can execute CGI scripts, you can create your own logs dynamically. Listing 21.5 is an example of a program that generates a "Pseudo Access Log File" every time it is loaded. This program creates a log file similar to the server log files, but with richer information.

Listing 21.5. Source code for the make log program.

// makelog.cpp -- MAKE LOG PROGRAM // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // This program creates a log file similar to the server log files, just // with richer information. // // By Shuman Ghosemajumder, Anadas Software Development // // GENERAL ALGORITHM // // 1. Get the desired environment variables // // 2. Write them to a file! // IncLUDES *********************************************************** #include <stdio.h> #include <string.h> #include <stdlib.h> #include <time.h> // DEFINES AND STRUCTURES ********************************************* #define MAX_STRING 256 #define DATE_STRING 32 #define HOUR_STRING 5 #define LOG_FILE "./pseudo-log" // FUncTION PROTOTYPES ************************************************ int main(int argc, char *argv[], char *env[]); void SafeGetEnv( char * env_name, char * * ptr, char * null_string ); // FUncTIONS ********************************************************** int main(int argc, char *argv[], char *env[]) { char * browser, * hostname, * refer_url; char date[32]; char empty_string[1]; time_t bintime; time(&bintime); sprintf( date,"%s\0", ctime(&bintime) ); date[24] = '\0'; // exactly 24 chars in length empty_string[0] = '\0'; SafeGetEnv( "REMOTE_HOST", &hostname, empty_string ); SafeGetEnv( "HTTP_REFERER", &refer_url, empty_string ); SafeGetEnv( "HTTP_USER_AGENT", &browser, empty_string ); FILE * fp; fp = fopen( LOG_FILE, "a" ); fprintf( fp, "%s %s %s %s\n", date, hostname, refer_url, browser ); fclose( fp ); return (0); // exit gracefully } void SafeGetEnv( char * env_name, char * * ptr, char * null_string ) { // Normally a NULL pointer is returned if a certain environment variable // doesn't exist and you try to retrieve it. This function set the value // of the pointer to point at a NULL string instead. char * tmp; tmp = getenv( env_name ); if( ! tmp ) *ptr = null_string; else *ptr = tmp; }

Logging Accesses

Now that we have a program to extract environment variable information, we're in much the same situation we were in when we simply had access to the access log file. We can create a huge log file of the various environment variable information we wish to keep track of, but the raw information isn't very useful unless we summarize it and have the output visible through the Web.

Listing 21.6 is a program that parses the pseudo access log created by the program in Listing 21.5. This program reads in the pseudo access log file generated by makelogg.cpp and generates an HTML as output. The document summarizes all of the raw information presented in that access log into useful categories. Figure 21.2 shows some sample output from it.

Figure 21.2: A sample shot of the output from the Pseudo Access Log Summary program

Listing 21.6. Source code listing for the Pseudo Access Log Summary program.

// parselog.cpp -- AccESS LOG SUMMARY PROGRAM for "MAKE LOG" // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // This program reads in the pseudo access log file generated by parselog.cpp // and generates an HTML document as output. The document summarizes all of // the raw information presented in that access log into useful categories. // // By Shuman Ghosemajumder, Anadas Software Development // // The categories it summarizes information for: // // * # of hits by domain // * # of hits by referrer // * # of hits by date // * # of hits by browser // // GENERAL ALGORITHM // // 1. For each domain and file path, dynamically create a linked list // for each value, and add 1 to the hit count each time. // // 2. Create a linked list for each date, as well as each hour also. // // 3. Send the output to stdout. // IncLUDES *********************************************************** #include <stdio.h> #include <string.h> #include <stdlib.h> #include "linklist.h" // Linked List Header File #include "linklist.cpp" // Linked List Functions // DEFINES AND STRUCTURES ********************************************* #define MAX_STRING 256 #define DATE_STRING 32 #define HOUR_STRING 5 #define LOG_FILE "./pseudo-log" typedef struct { char refer[MAX_STRING]; int num_access; } sREFER; typedef struct { char browser[MAX_STRING]; int num_access; } sBROWSER; typedef struct { char hostname[MAX_STRING]; int num_access; } sHOSTNAME; typedef struct { char date[DATE_STRING]; int num_access; } sDATE; // FUncTION PROTOTYPES ************************************************ int main(int argc, char *argv[], char *env[]); void ProcessLine( char * line ); void PrintOutput( void ); void InitAll(void); void DestroyAll(void); // GLOBAL VARIABLES *************************************************** sLINK * link_hostname; sLINK * link_date; sLINK * link_refer; sLINK * link_browser; // FUncTIONS ********************************************************** int main(int argc, char *argv[], char *env[]) { printf("Content-type: text/html\n\n"); printf("<HTML><TITLE>Pseudo Access Log Summary</TITLE><BODY>\n"); printf("<H1>Pseudo Access Log Summary</H1>\n"); FILE * fp; fp = fopen( LOG_FILE, "r" ); // open the access log file if( ! fp ) { printf("ERROR: Couldn't load log file!"); // abort painlessly } else // if able to load file... { char line[512]; InitAll(); for(;;) { // fetch lines until EOF encountered if( fgets( line, 511, fp ) == NULL ) break; ProcessLine( line ); // extract the important information } PrintOutput(); // send the output to stdout } DestroyAll(); printf("</ul></BODY></HTML>\n"); // end the HTML file return(0); // exist gracefully } void InitAll(void) { // Initialize the head pointers InitHead( &link_hostname ); InitHead( &link_refer ); InitHead( &link_browser ); InitHead( &link_date ); } void DestroyAll(void) { // Destroy the linked lists and free memory DestroyList( &link_hostname ); DestroyList( &link_refer ); DestroyList( &link_browser ); DestroyList( &link_date ); } void ProcessLine( char * line ) { // Process a single line of the pseudo access log file sHOSTNAME hn; sREFER rf; sBROWSER bs; sDATE dt; char * left, * right; sLINK * l; // Line Structure: // // get the date (24 chars) // get a space // get the hostname // get a space // get the refering URL // get a space // get the browser type (the remainder of the line) left = line; right = (char *) left + 10; memcpy( dt.date, left, right-left ); *(dt.date + (right-left) ) = '\0'; l = FindNode( link_date, (void *) &dt, 0, strlen( dt.date ) ); if( ! l ) { dt.num_access = 1; AddNode( link_date, (void *) &dt, sizeof(sDATE) ); } else { ((sDATE *) l->data)->num_access++; } left = &line[25]; // skip the hour and the space right = strchr( left, ' ' ); // find the next space if( ! right ) return; // bad entry memcpy( hn.hostname, left, right-left ); // get the first one *(hn.hostname + (right-left) ) = '\0'; l = FindNode( link_hostname, (void *) &hn, 0, strlen( hn.hostname ) ); if( ! l ) { hn.num_access = 1; AddNode( link_hostname, (void *) &hn, sizeof( sHOSTNAME ) ); } else { ((sHOSTNAME *) l->data)->num_access++; } left = right+1; // skip the space right = strchr( left, ' ' ); // find the next space (filename with path) if( ! right ) return; // bad entry memcpy( rf.refer, left, right-left ); // get the first one *(rf.refer + (right-left) ) = '\0'; l = FindNode( link_refer, (void *) &rf, 0, strlen( rf.refer ) ); if( ! l ) { rf.num_access = 1; AddNode( link_refer, (void *) &rf, sizeof( sREFER ) ); } else { ((sREFER *) l->data)->num_access++; } left = right+1; // skip the space right = strchr( left, '\n' ); // find the end if( ! right ) return; // bad entry memcpy( bs.browser, left, right-left ); // get the first one *(bs.browser + (right-left) ) = '\0'; l = FindNode( link_browser, (void *) &bs, 0, strlen( bs.browser ) ); if( ! l ) { bs.num_access = 1; AddNode( link_browser, (void *) &bs, sizeof( sBROWSER ) ); } else { ((sBROWSER *) l->data)->num_access++; } } void PrintOutput( void ) { // Send the output of the program to stdout sLINK * l; l = link_date; printf("<H2>By Date</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sDATE *) (l->data))->date, ((sDATE *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n"); l = link_hostname; printf("<H2>By Hostname</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sHOSTNAME *) (l->data))->hostname, ((sHOSTNAME *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n"); l = link_refer; printf("<H2>By Referer</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B><a href=\"%s\">%s</a> :</B> %d\n", ((sREFER *) (l->data))->refer, ((sREFER *) (l->data))->refer, ((sREFER *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n"); l = link_browser; printf("<H2>By Browser</H2>\n"); printf("<ul>\n"); for(;l;) { if( l->data ) { printf("<li> <B>%s :</B> %d\n", ((sBROWSER *) (l->data))->browser, ((sBROWSER *) (l->data))->num_access ); l = l->next; } else break; } printf("</ul>\n");
}

This program can also be run on a regular basis via crontab, and thus users always have access to relatively current information. If it's critical that users have access to immediate information, you can create an access log program that uses some sort of database management system to find pre-existing "user records" (sorted perhaps on hostname or IP address) and adds information to that "user profile." Thus, the information would always be in a summarized format, and the on-line reader program would simply display the file's contents.

How to Implement Tracking CGIs

Up until now, you may not have given much thought to exactly how your Web server was allowing you to run CGIs. But consider that the programs you've seen so far in this chapter deal with user information that the regular visitor to your Web site would most likely never see. Surely you're not going to make them visit a URL they have no interest in visiting simply so you can store their information! Yet that's exactly what you'd be forced to do if you called your tracking CGIs via a URL that references a program in the /cgi-bin/ directory. Clearly, it's important for the tracking process to be completely transparent to the users yet still work just as efficiently for you. There's more than one way you can accomplish this.

index.cgi

Your Web server is probably set up in such a manner that if you have a file named index.html or perhaps home.html in a specific directory, then that is the HTML file which is loaded by the server and displayed to the browser if the user attempts to load a URL in which the directory name, but not the exact file, is specified. On just about every single Web server, there is an option that can be set (in the srm.conf file on ncSA httpd compatible Web servers) that allows index.cgi to be the default file that is loaded. This allows you to actually run a CGI script every time a user accesses the base document in any directory-while the user sees an HTML file as usual! The easiest way to accomplish this is to make index.cgi a shell script such as

#!/bin/sh ./logapp echo Content-type: text/html echo cat real-home.html

First, the logging program (logapp) is called to store the user information into a file. The log program doesn't actually produce any output, and it has full access to the environment variable information that any explicitly called CGI script would. Then, the two echo commands send the HTTP command to the Web browser that an HTML document is coming forth, after which the actual home document for that directory is sent to the browser. This is the most preferable method because it allows you the greatest degree of control, with the ability to not only execute CGI applications, but also to send direct HTTP commands.

index.shtml

If your server has server-side includes enabled, you can create a .shtml (server-parsed HTML file), which allows you to call a CGI from within the HTML file. You can use the following syntax to invoke a CGI this way:

Or, if you must execute programs from cgi-bin, use

Including CGIs in Images

If your server has support for neither index.cgi nor index.shtml, you can still create a user-tracking CGI application that is automatically executed when you access a Web site, but it is slightly more limited. You can create a CGI shell script in your cgi-bin directory that looks something like this:

#!/bin/sh ./logapp echo Content-type: image/gif echo cat image.gif

This program sends an image on the Web server to the browser but first executes the user logging application transparently. You would execute this script by including its URL in the Web page you wanted to monitor as an image. For example:

<img src="http://www.anadas.com/cgi-bin/log-image.cgi">

This would display an image on the Web browser, while your logging application would get executed every time the page was loaded-totally transparent to visitors to your site.

A Simple Web Counter

The idea of sending an image to the Web browser while "secretly" running a logging application need not be so secret. In fact, many logging applications prefer to return a custom image file that displays information such as the current number of hits to that Web page. You may have seen odometer-like images on some Web sites and wondered how you might create your own. You could certainly use one of the services on the Internet such as www.digits.com, which allows you to use their CGI application to both log your hits and display the fancy graphic, but you now have the tools to create your own such counter.

Listing 21.7 is an example of a simple Web counter. Its output is depicted in Figure 21.3.

Figure 21.3: Sample screen shot of the output from the graphical Web counter.

Listing 21.7. Source code listing for the graphical Web counter script.

// counter.cpp -- a graphical counter for a web page, to be included through // an IMG tag in an HTML document // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // Written by Shuman Ghosemajumder, Anadas Software Development // // General Algorithm: // // 1. Determine the filename to be read from / written to. // 2. Update the counter data. // 3. Convert the current count to an X-bitmap. // 4. Output that X-bitmap to stdout // IncLUDE FILES ************************************************************ #include <stdio.h> #include <stdlib.h> #include <strings.h> // DEFINES / PROTOTYPES ***************************************************** #define DIGIT_WIDTH 8 #define DIGIT_HEIGHT 12 #define NUM_DIGITS 6 #define DATA_FILENAME "counter.dat" int main(int argc, char *argv[], char *env[]); // GLOBAL VARIABLES ********************************************************* char *xbmp_digits[10][12] = { {"0x7e", "0x7e", "0x66", "0x66", "0x66", "0x66", "0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e"}, {"0x18", "0x1e", "0x1e", "0x18", "0x18", "0x18", "0x18", "0x18", "0x18", "0x18", "0x7e", "0x7e"}, {"0x3c", "0x7e", "0x66", "0x60", "0x70", "0x38", "0x1c", "0x0c", "0x06", "0x06", "0x7e", "0x7e"}, {"0x3c", "0x7e", "0x66", "0x60", "0x70", "0x38", "0x38", "0x70", "0x60", "0x66", "0x7e", "0x3c"}, {"0x60", "0x66", "0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e", "0x60", "0x60", "0x60", "0x60"}, {"0x7e", "0x7e", "0x02", "0x02", "0x7e", "0x7e", "0x60", "0x60", "0x60", "0x66", "0x7e", "0x7e"}, {"0x7e", "0x7e", "0x66", "0x06", "0x06", "0x7e", "0x7e", "0x66", "0x66", "0x66", "0x7e", "0x7e"}, {"0x7e", "0x7e", "0x60", "0x60", "0x60", "0x60", "0x60", "0x60", "0x60", "0x60", "0x60", "0x60"}, {"0x7e", "0x7e", "0x66", "0x66", "0x7e", "0x7e", "0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e"}, {"0x7e", "0x7e", "0x66", "0x66", "0x7e", "0x7e", "0x60", "0x60", "0x60", "0x66", "0x7e", "0x7e"} }; int main(int argc, char *argv[], char *env[]) { char filename[256]; // the data filename int xbmp_count[NUM_DIGITS+1]; // the image buffer for the counter unsigned long count; // the variable to store the counter int i, j; // Looping variables if ( argc >= 2 ) { // if there is a command line parameter (passed after the ? operator // in the GET query), then the filename to store the data in should // be that parameter plus a ".dat" extension. sprintf( filename, "%s.dat", argv[1] ); } else { // Otherwise, use the default filename strcpy( filename, DATA_FILENAME ); } FILE * fp; if( ! (fp = fopen( filename, "rb" )) ) // try to open the file { count = 0; // if failure, reset counter } else // if success, { fread( &count, sizeof(unsigned long), 1, fp ); // read in the counter fclose(fp); } if( fp = fopen( filename, "wb" ) ) { fwrite( &(++count), sizeof(unsigned long), 1, fp ); // update counter fclose(fp); } printf("Content-type:image/x-xbitmap\n\n"); // the HTTP header // Separate the digits of the current counter value xbmp_count[NUM_DIGITS] = '\0'; for( i=0;i< NUM_DIGITS;i++) { j = count % 10; xbmp_count[NUM_DIGITS-1-i] = j; count /= 10; } printf("#define counter_width %d\n",NUM_DIGITS*DIGIT_WIDTH); printf("#define counter_height %d\n\n",DIGIT_HEIGHT); // send the X-Bitmap information to stdout printf("static char counter_bits[] = {\n"); for(i=0;i < DIGIT_HEIGHT; i++) { for(j=0;j < NUM_DIGITS; j++) { printf("%s", xbmp_digits[xbmp_count[j]][i] ); if( (i < DIGIT_HEIGHT-1 ) || ( j< NUM_DIGITS-1 ) ) { printf(", "); } } printf("\n"); } printf("}\n"); }

Calling counter.cgi

Listing 21.8 shows some sample HTML code for a hypertext page that uses the graphical Web counter for invoking counter.cgi within a hypertext document.

Listing 21.8. HTML source code for a hypertext page.

<HTML> <!-- http://www.anadas.com/cgiunleashed/trackuser/counter.html By Shuman Ghosemajumder, Anadas Software Development --!> <TITLE>Graphical Web Counter</TITLE> <BODY> <H1>Graphical Web Counter</H1> <ul><li><B> This page has been accessed <img src="http://www.anadas.com/cgiunleashed/trackuser/Âcounter.cgi?counter.html"> times. </B></ul> </BODY></HTML>

Locating Users Geographically

So far you've noticed that we're able to keep track of a great deal of information about visitors to our sites, but most of it is very "computer-related" rather than "real-world." In other words, it's great to know what their IP address, their hostname, and their HTTP-acceptance parameters are, but it's even better to know where they're dialed-in from, or even better, their name. It should already be quite clear that determining a user's name or e-mail address is very near impossible to do on anything remotely resembling a consistent basis, so any such notions are purely fanciful. Determining their general geographic location, however, is a piece of real-world information that is much more realistically attainable.

Discussion of Feasibility

The location from which a user is dialed-in (or directly connected to the Internet) is a piece of information that is most definitely not passed through any kind of environment variable. In fact, the vast majority of Web browsing programs probably don't have a clue as to where they're running from; one hard drive is just the same as any other to a freshly downloaded copy of Netscape or Internet Explorer, for example. There are two pieces of information you can use to determine geographic information, however: the hostname and the IP address.

The hostname can immediately provide some important, and almost guaranteed correct, geographic information via the first-level domain. Internet domains work from right to left, so that the first-level domain is represented by the rightmost string, the second-level domain is represented by the value to the left of that, and so on. For example, in the address www.anadas.com, com is the first-level (or top-level) domain, while anadas.com is the second-level domain. The first-level domains are decidedly finite in number and determine either the geographical location or the nature of the organization. For example, .com denotes a commercial organization, while a .ca extension denotes an organization in Canada. The various first-level domains are as follows:

Code	Country	Code	Country
AD	Andorra	LS	Lesotho
AE	United Arab Emirates	LT	Lithuania Ex-USSR
AF	Afghanistan	LU	Luxembourg
AG	Antigua and Barbuda	LV	Latvia
AI	Anguilla	LY	Libya
AL	Albania	MA	Morocco
AM	Armenia Ex-USSR	MC	Monaco
AN	Netherland Antilles	MD	Moldavia Ex-USSR
AO	Angola	MG	Madagascar
AQ	Antarctica	MH	Marshall Islands
AR	Argentina	ML	Mali
AS	American Samoa	MM	Myanmar
AT	Austria	MN	Mongolia
AU	Australia	MO	Macau
AW	Aruba	MP	Northern Mariana Isl.
AZ	Azerbaidjan Ex-USSR	MQ	Martinique (Fr.)
BA	Bosnia-Herzegovina Ex-Yugoslavia	MR	Mauritania
BB	Barbados	MS	Montserrat
BD	Bangladesh	MT	Malta
BE	Belgium	MU	Mauritius
BF	Burkina Faso	MV	Maldives
BG	Bulgaria	MW	Malawi
BH	Bahrain	MX	Mexico
BI	Burundi	MY	Malaysia
BJ	Benin	MZ	Mozambique
BM	Bermuda	NA	Namibia
BN	Brunei Darussalam	nc	New Caledonia (Fr.)
BO	Bolivia	NE	Niger
BR	Brazil	NF	Norfolk Island
BS	Bahamas	NG	Nigeria
BT	Buthan	NI	Nicaragua
BV	Bouvet Island	NL	Netherlands
BW	Botswana	NO	Norway
BY	Belarus Ex-USSR	NP	Nepal
BZ	Belize	NR	Nauru
CA	Canada	NT	Neutral Zone
cc	Cocos (Keeling) Isl.	NU	Niue
CF	Central African Rep.	NZ	New Zealand
CG	Congo	OM	Oman
ch	Switzerland	PA	Panama
CI	Ivory Coast	PE	Peru
CK	Cook Islands	PF	Polynesia (Fr.)
CL	Chile	PG	Papua New Guinea
CM	Cameroon	PH	Philippines
CN	China	PK	Pakistan
CO	Colombia	PL	Poland
CR	Costa Rica	PM	St. Pierre & Miquelon
CS	Czechoslovakia	PN	Pitcairn
CU	Cuba	PT	Portugal
CV	Cape Verde	PR	Puerto Rico (US)
CX	Christmas Island	PW	Palau
CY	Cyprus	PY	Paraguay
CZ	Czech Republic	QA	Qatar
DE	Germany	RE	Reunion (Fr.)
DJ	Djibouti	RO	Romania
DK	Denmark	RU	Russian Federation Ex-USSR
DM	Dominica	RW	Rwanda
DO	Dominican Republic	SA	Saudi Arabia
DZ	Algeria	SB	Solomon Islands
EC	Ecuador	SC	Seychelles
EE	Estonia Ex-USSR	SD	Sudan
EG	Egypt	SE	Sweden
EH	Western Sahara	SG	Singapore
ES	Spain	SH	St. Helena
ET	Ethiopia	SI	Slovenia Ex-Yugoslavia
FI	Finland	SJ	Svalbard & Jan Mayen Isl.
FJ	Fiji	SK	Slovak Republic
FK	Falkland Isl.(Malvinas)	SL	Sierra Leone
FM	Micronesia	SM	San Marino
FO	Faroe Islands	SN	Senegal
FR	France	SO	Somalia
FX	France (European Ter.)	SR	Suriname
GA	Gabon	ST	St. Tome and Principe
GB	Great Britain	SU	Soviet Union
GD	Grenada	SV	El Salvador
GE	Georgia Ex-USSR	SY	Syria
GH	Ghana	SZ	Swaziland
GI	Gibraltar	TC	Turks & Caicos Islands
GL	Greenland	TD	Chad
GP	Guadeloupe (Fr.)	TF	French Southern Terr.
GQ	Equatorial Guinea	TG	Togo
GF	Guyana (Fr.)	TH	Thailand
GM	Gambia	TJ	Tadjikistan Ex-USSR
GN	Guinea	TK	Tokelau
GR	Greece	TM	Turkmenistan Ex-USSR
GT	Guatemala	TN	Tunisia
GU	Guam (US)	TO	Tonga
GW	Guinea Bissau	TP	East Timor
GY	Guyana	TR	Turkey
HK	Hong Kong	TT	Trinidad & Tobago
HM	Heard & McDonald Isl.	TV	Tuvalu
HN	Honduras	TW	Taiwan
HR	Croatia Ex-Yugoslavia	TZ	Tanzania
HT	Haiti	UA	Ukraine Ex-USSR
HU	Hungary	UG	Uganda
ID	Indonesia	UK	United Kingdom
IE	Ireland	UM	US Minor outlying isl.
IL	Israel	US	United States
IN	India	UY	Uruguay
IO	British Indian O. Terr.	UZ	Uzbekistan Ex-USSR
IQ	Iraq	VA	Vatican City State
IR	Iran	VC	St. Vincent & Grenadines
IS	Iceland	VE	Venezuela
IT	Italy	VG	Virgin Islands (British)
JM	Jamaica	VI	Virgin Islands (US)
JO	Jordan	VN	Vietnam
JP	Japan	VU	Vanuatu
KE	Kenya	WF	Wallis & Futuna Islands
KG	Kirgistan Ex-USSR	WS	Samoa
KH	Cambodia	YE	Yemen
KI	Kiribati	YU	Yugoslavia
KM	Comoros	ZA	South Africa
KN	St. Kitts Nevis Anguilla	ZM	Zambia
KP	Korea (North)	ZR	Zaire
KR	Korea (South)	ZW	Zimbabwe
KW	Kuwait	ARPA	Old-style Arpanet
KY	Cayman Islands	COM	Commercial
KZ	Kazachstan Ex-USSR	EDU	Educational
LA	Laos	GOV	Government
LB	Lebanon	INT	International
LC	Saint Lucia	MIL	US Military
LI	Liechtenstein	NATO	Nato
LK	Sri Lanka	NET	Network
LR	Liberia	ORG	Non-Profit Organization

If you're lucky enough to get a user whose hostname contains one of the geographical top-level domains, you can easily match the extension against the preceding table and determine which country he or she is from. However, the vast majority of users on the Internet are likely going to be accessing your site from a .com, .org, .edu, or .net domain. These domains are administered by InterNIC and can be given to organizations and institutions all over the world. Thus, the domain name alone doesn't provide us with their geographical location.

Introduction to `NSLOOKUP` and `WHOIS`

This is where the InterNIC database itself comes in. Whenever an organization is administered a domain name by InterNIC, a record is kept of various information about that organization on InterNIC's own computer system. InterNIC is kind enough to allow the public access to this information, and the speed and ease by which one can access it is excellent. The InterNIC whois database can be accessed with the following command:

whois -h rs.internic.net [domain name]

where [domain name] is the name of the domain you want further information on. Remember that in order to be able to find any information in InterNIC's database on a domain, that domain must have been directly administered by InterNIC. Thus, trying to access information on a .ca domain (which is administered by the CA domain registration committee in Canada) is quite futile. Here is an example of the output from a whois query on the domain name anadas.com:

Anadas Software Development (ANADAS-DOM) 38 Grasmere Crescent London, Ontario N6G 4N8 CANADA Domain Name: ANADAS.COM Administrative Contact, Billing Contact: Ghosemajumder, Shuman (SG331) shuman@ANADAS.COM (519) 858-0021 Technical Contact, Zone Contact: Dice, Richard (RD78) rdice@ANADAS.COM (519) 858-0021 Record last updated on 10-Jun-96. Record created on 15-Jul-95. Domain servers in listed order: NS.ANADAS.COM 199.45.70.4 NS.UUNET.CA 142.77.1.1 AUTH01.NS.UU.NET 198.6.1.81

Note that originally all we had was the hostname(anadas.com), yet now we have the company's country of origin, their province, and even their street address! In addition, we have contact names and even phone numbers! Of course, there's no guarantee that the individual user at the given address is going to be one of the InterNIC registration contacts; in fact, for most organizations, the odds are quite against it. But we do know the country associated with this organization, so we can record it as an access from Canada.

In many cases, the information on a particular hostname may be difficult to find on InterNIC's whois server because the domain is administered by a parent organization. Or perhaps you might have a numerical IP address that is sent as the hostname field. In these instances, you must do a whois lookup on the IP address itself, another query format supported by InterNIC's whois server.

In the case of a domain that is administered by a parent organization, it's useful to use nslookup to determine the IP address of the actual machine. The format for calling nslookup is

nslookup [hostname]

In this case, doing a lookup on www.anadas.com yields the following output:

Name: www.anadas.com Address: 199.45.70.165

The IP address will always have four numbers separated by three periods, and the fourth number can always be ignored because it is resolved by the DNS server local to that domain. So we then do a whois query on 199.45.70, which yields the same information as before (or the information for the controlling organization we're looking for). Note that if this information is not available, we can strip off the next number and do a lookup on 199.45, which will give an even more generalized answer.

The information returned by InterNIC is in a relatively standardized format that is easily machine-parsable to allow you to create programs that automatically log additional information based on the hostname or IP address.

Limitations of Tracking Users Through IP Addresses

Tracking user's geographical locations by using the IP address or hostname as the basis for an InterNIC whois query works in most cases, but certainly not in all. Consider the case of an Internet Service Provider (ISP) based in Houston, which may have points of presence in New York and Los Angeles. The New York users would still have an IP address registered to the company in Houston, but recording their visit as a visit from a person in Houston would be quite erroneous. An example of this, on a much bigger scale, is the case of major on-line services like CompuServe and America Online. These services now provide access to the Internet, but it's all done through proxy servers connected to their centralized network. This means that users all over North America would be reported as connecting from the headquarters of the on-line service they were using rather than where they were really connecting from!

A work-around is to attempt to identify the major on-line services and organizations and build in contingency routines for users from those sites. But in the end, there are no totally definite methods of determining the geographic location of a user when given only an ambiguous IP address or hostname.

Cookies

Until now, we've been discussing methods of determining information about users prior to their visiting your Web site. Details such as their browser type, geographic location, and e-mail address exist before they ever visit your Web site. However, it's often very useful to be able to determine information about users after they've visited your Web site for the first time.

This is an excellent application for cookies. When a user initially visits your site, a cookie is assigned to their browser, which is then sent back to your Web server on each subsequent connect to your site. Thus, you can track information about how many "repeat visitors" your site gets, plus how these repeat visitors use the content on your site.

Listing 21.9 shows an example of a program that tracks users' visits through the use of cookies. Its output is depicted in Figure 21.4.

Figure 21.4: Sample screen shot of the output from the cookie-based counter..

Listing 21.9. Source code listing for the cookie counter script.

// set-cookie.cpp -- SET COOKIE PROGRAM // Available on-line at http://www.anadas.com/cgiunleashed/trackuser/ // // This program uses cookies to track the number of times a specific user // has visited the script. // // By Shuman Ghosemajumder, Anadas Software Development // // GENERAL ALGORITHM // // 1. Check whether or not a cookie was passed. // 2. If one was, increment the counter. If not, create a blank cookie. // 3. Re-send the new cookie, blank or otherwise, to the browser. // 4. Display the relevant output to stdout // // Notes: This program uses META HTTP-EQUIV rather than an actual HTTP // directive to ensure maximum compatibility. Certain servers seem // to have problems with cookies, but this should work across most // platforms. // IncLUDES *********************************************************** #include <stdio.h> #include <string.h> #include <stdlib.h> #include <time.h> // FUncTION PROTOTYPES ************************************************ int main(int argc, char *argv[], char *env[]); void SafeGetEnv( char * env_name, char * * ptr, char * null_string ); // FUncTIONS ********************************************************** int main(int argc, char *argv[], char *env[]) { char * cookie; char empty_string[1]; char * p; int val=0; empty_string[0] = '\0'; SafeGetEnv( "HTTP_COOKIE", &cookie, empty_string ); printf("Content-type: text/html\n\n"); printf("<HTML><HEAD>"); printf("<META HTTP-EQUIV=\"Set-Cookie\" "); p = strstr( cookie, "COUNT=" ); if( ! p ) printf("Content=\"COUNT=0; expires=01-Jan-99 GMT; path=/cgiunleashed/Âtrackuser; domain=.anadas.com\">\n"); else { p += strlen("COUNT="); char * ps; ps = strchr( p, ';'); *ps = '\0'; val = atoi( p ); val++; printf("Content=\"COUNT=%d; expires=01-Jan-99 GMT; path=/cgiunleashed/Âtrackuser; domain=.anadas.com\">\n", val); } printf("<TITLE>Cookie Test</TITLE></HEAD>\n"); printf("<BODY>\n"); printf("<H1>Cookie Test!</H1><HR><P>\n"); if( val > 0 ) { printf("<H3>You have been here %d times!</H3>\n", val ); } else { printf("<H3>You have now been assigned a cookie!</H3>\n"); } printf("</BODY></HTML>\n"); return(0); // exit gracefully } void SafeGetEnv( char * env_name, char * * ptr, char * null_string ) { // Normally a NULL pointer is returned if a certain environment variable // doesn't exist and you try to retrieve it. This function sets the value // of the pointer to point at a NULL string instead. char * tmp; tmp = getenv( env_name ); if( ! tmp ) *ptr = null_string; else *ptr = tmp; }

Other Methods of Tracking Users

We've discussed several general methods of tracking information about any visitor to our Web site. But what about specific users? The markets for most successful Web sites that aren't incredibly general-purpose themselves (such as search engines or total Internet directories like Yahoo!) are usually very specifically targeted. This means that you already know certain things about the majority of your users, which can give you an advantage in tracking additional information about them.

For example, if you were creating a site for doctors and other health care professionals, you could use a database of all the major hospitals in North America to determine which hostnames and IP addresses correspond to which health care centers.

Fingering Dial-Up Servers

Earlier in the chapter, I stated that you couldn't get a general user's e-mail address on any consistent basis. While this is true, when you have a highly targeted Web site that generates hits from a limited audience, there is the possibility of determining the user's e-mail address-if, and only if, you have the name of the machine where their actual login takes place, and that machine has a publicly accessible finger daemon configured and running.

If you think this sounds like a very specific set of circumstances, you're right. Fortunately, the vast majority of ISPs (Internet Service Providers) and even most standard servers are set up in this manner. The format for the finger command in this case is

finger @hostname

Keep in mind that the hostname is not necessarily the hostname they are accessing your site from. In the case of dial-up users, the hostname they are accessing you from refers to a specific SLIP or PPP port while you're looking for the server that contains the catalog of all SLIP or PPP connections. In the case that the user is accessing your site from a terminal on the reported hostname, you may have better luck. If you do manage to determine the hostname of the server you're looking for, the output will be something like this:

[dialup.anadas.com] USER TTY FROM LOGIN@ IDLE WHAT tsuki 00 borg 11:55AM 54 -su (tcsh) rxm43 p0 pm66 9:43AM 0 -tcsh (tcsh) ayondey p1 alice 11:36AM 30 -su (tcsh) challaday p2 tc248 1:07PM 59 -tcsh (tcsh) damian p3 lorne 2:17PM 19 /bin/sh /usr/local/bin/mm (mm) shuman p4 sky 1:35PM 1 netscape & rsilver p5 pm81 2:34PM 0 w

Notice that we're given a complete list of users who are currently on the system in question. We would then determine which of these users was our visitor by looking at the WHAT field to see which user was running a Web browser at the time of our lookup. In this case, we see that user shuman was running Netscape Navigator, so he is the one who was accessing our site.

Caution

This example provides a great deal of information about the user who has accessed your site and will work under only the right, "lucky" circumstances. Nonetheless, acquiring e-mail addresses and then sending junk e-mail (or any other kind of unsolicited e-mail) is considered to be a grievous breach of etiquette and is a practice that should never be adopted.

The Ethics of Tracking Users

This chapter has revealed some very powerful techniques by which you can determine a great deal of information about the visitors to your site. However, as the saying goes, "With great power comes great responsibility," and this topic is no exception to this axiom. Privacy is one of the most important issues that people must address when using the Internet. As Web developers, we must always strive to never compromise the privacy of our audience, for the benefit of the industry as a whole. People use the Internet exactly as much as they trust it-no more. A single case of one user's privacy being compromised can reduce the level of trust of all users immeasurably.

Some excellent on-line resources on these topics include the following:

http://www.yahoo.com/Government/Law/Privacy/http://www.anu.edu.au/people/Roger.Clarke/DV/
http://www.uiuc.edu/~ejk/WWW-privacy.html

Accessing This Chapter Online

You can access all of the code listings in this chapter, with accompanying executables, by visiting

http://www.anadas.com/cgiunleashed/trackuser/

The site is shown in Figure 21.5.

Figure 21.5: Screen shot of the Web site which contains the listings for this chapter..

Summary

The methods presented in this chapter will allow you to track just about every piece of information which is available about the users who access your Web site. Only you will be able to determine which bits of data are the most useful to you, and you will most likely want to concentrate on tracking those. Note that summarizing raw data is the key to creating truly useful demographic reports. While there are a finite number of types of this raw data, there are many more ways in which you can summarize the data into cumulative categories, emphasizing the interrelationships within the data over the bare facts themselves. In other words, this is only the beginning. Good luck!

Chapter 21

Tracking Users

CONTENTS