Chapter 21

Tracking Users


CONTENTS


There are several different methods you can use to track users. They are

Why Do We Need to Track Users?

It's easy enough to set up a World Wide Web site for yourself or your organization and gauge its success or failure solely on the amount of response you get via e-mail, phone, or fax. But if you rely on so simplistic a tracking mechanism, you won't get anywhere near the whole picture. Perhaps your site is attracting many visitors, but your response form is hard to find, so very few of them are getting in touch with you. Perhaps many people find your Web site via unrelated searches on Internet search engines and promptly leave. Or perhaps you've optimized your site for Netscape, but the people most interested in your content are using ncSA Mosaic and can't view any of your in-line images! In any of these cases, you could spend a long time waiting for user responses while being totally in the dark about why you weren't getting any responses.

This illustrates why it's so important to track user information on a constant basis. You can gain valuable insights not only into who is accessing your site, but also how they're finding it and where they might have heard of you. Plus, there's the all-important question of the total number of users visiting your site.

How Search Engines Work
Search engines such as Alta Vista, WebCrawler, InfoSeek, Lycos, and Excite possess vast databases of information, cataloging much of the content on the World Wide Web. Not only is the creation of such a huge database a task more difficult than any group of people could manually accomplish, it's also necessary to update all of the information on an increasingly frequent basis. Thus, the creators of these services designed automatic "robots" that roam the Web and retrieve Web site information for inclusion in the database. While this deals with the speed problem quite nicely, there is a serious problem introduced by this automatic approach: Machines, even ones with so-called artificial intelligence software, are still nowhere near as good as humans at categorizing information (well, at least not into categories that make sense to humans!). When a search engine's robot visits a site, it incorporates all of the text on that site into its database for reference in subsequent user searches. This means that a word inadvertently placed in the text of your Web site can cause people to find your site via searches on that word, thinking that your site might have something to do with that word! Suppose that you've set up a Web site about gardening, and in it you include a personal anecdote about how much your dog loves being outdoors with you. Thousands of dog-lovers might find your site because of that reference to your dog, be surprised that the site is about gardening and not dogs, and promptly leave! There are many other problems associated with the way automatic search engines work, which you'll no doubt discover when your site is added to them.

The Essence of Web Marketing

With the incredible corporate interest in the World Wide Web in the past few years, tracking users helps us get closer to an answer to the most crucial question for most organizations getting on the Web: Does the Web really work? In other words, does their Web site attract visitors, and if so, do those visitors turn into customers? In other media, hard numbers are available as answers to these questions. Newspapers have circulation figures, radio has broadcast ranges, and television has Nielsen ratings. It's surprising how many Web sites have unmonitored access levels since more precise visitor information can be gained on the Internet than through any other medium.

There is one key advantage these other media have over the Web, however: access to demographic information. The reason that accurate demographics (for example, the makeup of the audience by age, sex, income, and so on) are much more readily available for these traditional media is because the level of market penetration is such that a representative sampling of the general population in that area can be extrapolated meaningfully to apply to your whole audience. With the Web, you have several problems in doing this:

Both of these problems mean that the only way you could get accurate demographics would be while people are actually visiting your Web site. This can come across as somewhat obtrusive, and people accustomed to browsing through Web sites at high-speed with little or no thought involved have to be given a very good incentive to spend the time to fill out a survey form for your benefit.

This means that it's all the more crucial to identify whatever hard numbers you can automatically, and this is where the idea of tracking users comes in.

Parsing Access Logs

This section deals with one of the fundamental methods of collecting demographic information about visitors to your Web site-the access log.

What Is an Access Log?

So where do we begin when trying to find out information about visitors to our site? How about on our Web server itself! It's mentioned earlier on in the book that HTTP, the HyperText Transfer Protocol, enables communication between your browser and the Web server via
a series of discrete connections that fetch the text of the Web page being retrieved, and then each one of the graphics on that page in sequence. Did you know that every single time one of these requests is made, a record of that request is written to a log file? Here is a sample of the contents of an access log, from the file access-log, produced by ncSA httpd.

    ts17-15.slip.uwo.ca - - [09/Jul/1996:01:53:53 -0500]
"POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1519
    ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:22 -0500]
"POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1954
    ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:43 -0500]
"POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1678
    pm107.spots.ab.ca - - [09/Jul/1996:01:59:28 -0500] "GET /pics/asd.gif HTTP/1.0" Â304 0
    b61022.dial.tip.net - - [09/Jul/1996:02:03:36 -0500] "GET /pics/asd.gif HTTP/Â1.0" 200 4117
slip11.docker.com - - [09/Jul/1996:02:03:49 -0500] "GET /rcr/ HTTP/1.0" 200 8751
    slip11.docker.com - - [09/Jul/1996:02:04:17 -0500] "GET /rcr/guest.html HTTP/Â1.0" 200 2984
    slip11.docker.com - - [09/Jul/1996:02:05:01 -0500] "GET /rcr/store.html HTTP/Â1.0" 200 34717
    port52.annex1.net.ubc.ca - - [09/Jul/1996:02:05:09 -0500] "GET /pics/asd.gif ÂHTTP/1.0" 200 4117
    slip11.docker.com - - [09/Jul/1996:02:06:01 -0500] "GET /rcr/regint.html HTTP/Â1.0" 200 19452

ncSA, CERN, and Apache httpd all produce access logs in very similar formats, and collectively they have the vast majority of Web server market share, so this section will deal with extracting information from those servers. Other Web servers may store information in a different format, and you should consult the documentation that comes with yours to learn how to read it.

Note
You may have heard of the HTTP keep-alive protocol, which allows for a continuous connection to be maintained between the Web server and the Web browser. This doesn't contradict the nature of the discrete connections in HTTP; there are still multiple fetches made from the Web server. The difference is that the connection isn't terminated and restarted between each one while retrieving information on the same Web page.

Now, let's take a look at some of the information that is provided in the access log. The lines all take on a standard format, and, in fact, the entire access log consists of nothing but lines like these. The format of the lines is as follows:

host rfc931 authuser [DD/Mon/YYYY:hh:mm:ss] "request" ddd bbbb "opt_referer" Â"opt_agent"

Here's a breakdown of the elements included in the lines:

Note that the last two fields are not usually enabled on most systems, and thus our sample program won't process them. It's easy enough to modify it so that it does, however.

With a line not only for each Web page access, but in fact for each graphic on each Web page as well, you might be able to imagine why access log files can grow to become several megabytes in size very quickly. If your Web server has a limited amount of hard drive space, the access log's growth might even risk crashing it!

One solution to this problem is to delete the access log on a regular basis, after creating a summary of the information in it. So how exactly do you create a summary? Good question! this is where we get into our first program for this chapter, an httpd access log parser. The individual lines in the access log file, while providing a fairly detailed amount of information, aren't terribly useful when viewed in their raw form. However, they can be used as the basis for all kinds of reports you can create with software that summarizes the information into various categories. An example of such a program is included in Listing 21.1. Its output is shown in Figure 21.1., the Access Log Summary program. This program reads in the server access log file and generates an HTML document as output. The document summarizes all of the raw information presented in the access log into useful categories.

Figure 21.1: The output from the access log summary program.


Listing 21.1. Source code for the Access Log Summary program.
// accsum.cpp -- AccESS LOG SUMMARY PROGRAM
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program reads in the server access log file and generates an HTML
// document as output.  The document summarizes all of the raw information
// presented in the access log into useful categories
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// The categories it summarizes information for:
//
// * # of hits by domain
// * # of hits by file path
// * # of hits by day
// * # of hits by hour
//
// GENERAL ALGORITHM
//
// 1. For each domain and file path, dynamically create a linked list
//    for each value, and add 1 to the hit count each time.
//
// 2. Create a linked list for each date, as well as each hour also.
//
// 3. Send the output to stdout.

// IncLUDES ***********************************************************

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include "linklist.h"           // Linked List Header Files

#include "linklist.cpp"         // Linked List Source Code

// DEFINES AND STRUCTURES *********************************************

#define MAX_STRING  256
#define DATE_STRING 32
#define HOUR_STRING 5

#define LOG_FILE "./test-access-log"

typedef struct
{   char hostname[MAX_STRING];
    int num_access;
} sHOSTNAME;

typedef struct
{   char filename[MAX_STRING];
    int num_access;
} sFILENAME;

typedef struct
{   char hour[HOUR_STRING];
    int num_access;
} sHOUR;

typedef struct
{   char date[DATE_STRING];
    int num_access;
} sDATE;


// FUncTION PROTOTYPES ************************************************

int main(int argc, char *argv[], char *env[]);
void ProcessLine( char * line );
void PrintOutput( void );
void InitAll(void);
void DestroyAll(void);

// GLOBAL VARIABLES ***************************************************

sLINK * link_hostname;
sLINK * link_filename;
sLINK * link_hour;
sLINK * link_date;


// FUncTIONS **********************************************************

int main(int argc, char *argv[], char *env[])
{
    // Opens the access log file, parses the information into a linked list
    // internal data representation, then sends the summary of the output to
    // stdout.

    printf("Content-type: text/html\n\n");
    printf("<HTML><TITLE>Access Log Summary</TITLE><BODY>\n");
    printf("<H1>Access Log Summary</H1>\n");

    FILE * fp;

    fp = fopen( LOG_FILE, "r" );                // open the access log file

    if( ! fp )
    {   printf("ERROR: Couldn't load log file!");    // abort painlessly
    }
    else                                         // if able to load file...
    {   char line[512];

        InitAll();

        for(;;)
        {   // fetch lines until EOF encountered

            if( fgets( line, 511, fp ) == NULL ) break;

            ProcessLine( line );        // extract the important information
        }

        PrintOutput();              // send the output to stdout
    }

    DestroyAll();

    printf("</ul></BODY></HTML>\n");    // end the HTML file

    return(0);    // terminate gracefully
}
void InitAll(void)
{
    // Initialize the heads for each of the linked lists

    InitHead( &link_hostname );
    InitHead( &link_filename );
    InitHead( &link_hour );
    InitHead( &link_date );
}

void DestroyAll(void)
{
    // Destroy each of the linked lists (to free memory)

    DestroyList( &link_hostname );
    DestroyList( &link_filename );
    DestroyList( &link_hour );
    DestroyList( &link_date );
}


void ProcessLine( char * line )
{
    // Parse a single line of a standard web server access log

    sHOSTNAME hn;
    sFILENAME fn;
    sHOUR hr;
    sDATE dt;
    char * left, * right;
    sLINK * l;

    left = line;

    right = strchr( left, ' ' );        // find the first space

    if( ! right ) return;               // bad entry

    memcpy( hn.hostname, left, right-left );    // get the first one
    *(hn.hostname + (right-left) ) = '\0';

    l = FindNode( link_hostname, (void *) &hn, 0, strlen( hn.hostname ) );

    if( ! l )
    {   hn.num_access = 1;

        AddNode( link_hostname, (void *) &hn, sizeof( sHOSTNAME ) );
    }
    else
    {   ((sHOSTNAME *) l->data)->num_access++;
    }

    left = right+1;                 // skip the space
    right = strchr( left, ' ');     // find the next space (rfc931)
    if( ! right ) return;           // bad entry

    left = right+1;                 // skip the space
    right = strchr( left, ' ');     // find the next space (authuser)
    if( ! right ) return;           // bad entry

    left = right+1;                 // skip the space
    right = strchr( left, ':');     // find the colon (date delimiter)
    if( ! right ) return;           // bad entry

    left++;                         // skip the leading '['

    memcpy( dt.date, left, right-left );    // get the first one
    *(dt.date + (right-left) ) = '\0';

    l = FindNode( link_date, (void *) &dt, 0, strlen( dt.date ) );

    if( ! l )
    {   dt.num_access = 1;

        AddNode( link_date, (void *) &dt, sizeof( sDATE ) );
    }
    else
    {   ((sDATE *) l->data)->num_access++;
    }

    left = right+1;                 // skip the colon
    right = strchr( left, ':');     // find the next colon (hour delimeter)
    if( ! right ) return;           // bad entry

    memcpy( hr.hour, left, right-left );    // get the first one
    *(hr.hour + (right-left) ) = '\0';

    l = FindNode( link_hour, (void *) &hr, 0, strlen( hr.hour ) );

    if( ! l )
    {   hr.num_access = 1;
        AddNode( link_hour, (void *) &hr, sizeof( sHOUR ) );
    }
    else
    {   ((sHOUR *) l->data)->num_access++;
    }

    left = strchr( line, '\"' );    // find the beginning of the request
    if( ! left ) return;            // bad entry

    right = strchr( left, ' ' );    // find the first space (Query Type)
    if( ! right ) return;           // bad entry

    left = right+1;                 // skip the space
    right = strchr( left, ' ' );    // find the next space (filename with path)
    if( ! right ) return;           // bad entry

    memcpy( fn.filename, left, right-left );    // get the first one
    *(fn.filename + (right-left) ) = '\0';

    l = FindNode( link_filename, (void *) &fn, 0, strlen( fn.filename ) );

    if( ! l )
    {   fn.num_access = 1;
        AddNode( link_filename, (void *) &fn, sizeof( sFILENAME ) );
    }
    else
    {   ((sFILENAME *) l->data)->num_access++;
    }
}


void PrintOutput( void )
{
    // Send the output from the program to stdout

    sLINK * l;

    l = link_date;

    printf("<H2>By Date</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sDATE *) (l->data))->date,
                                ((sDATE *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n");

    l = link_hour;

    printf("<H2>By Hour</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sHOUR *) (l->data))->hour,
                                ((sHOUR *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n");

    l = link_hostname;

    printf("<H2>By Hostname</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sHOSTNAME *) (l->data))->hostname,
                                ((sHOSTNAME *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n");

    l = link_filename;

    printf("<H2>By Filename</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sFILENAME *) (l->data))->filename,
                                ((sFILENAME *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n")
;
}

This program makes use of linked lists, which aren't supported directly in C as associative arrays are in Perl. Thus, there are some support routines that are needed in order to make the program function properly, and they are included here, in Listings 21.2 and 21.3.


Listing 21.2. The linked list routine.
// linklist.h  -- The Header file for the Linked List Routines
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// By Shuman Ghosemajumder, Anadas Software Development

// STRUCTURES *********************************************************

typedef struct linked_list
{   struct linked_list * next;
    void * data;
} sLINK;

// LINKED LIST FUncTION PROTOTYPES ************************************

void InitHead( sLINK * * head );
void DestroyList( sLINK * * head );
int CountNodes( sLINK * head );
sLINK * GetNext( sLINK * l );
sLINK * AddNode( sLINK * head, void * data, int data_size )
;
sLINK * FindNode( sLINK * head, void * data, int offset, int data_size );


Listing 21.3. Source code for the linked list functions.
// linklist.cpp -- Linked List Functions
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// By Shuman Ghosemajumder, Anadas Software Development


void InitHead( sLINK * * head )
{
    // Initialize the head pointer of a linked list

    *head = (sLINK *) malloc( sizeof(sLINK) );      // allocate memory

    if( ! *head )
    {   printf("Memory allocation error.\n");
        exit(-1);
    }

    (*head)->data = NULL;                           // no data yet
    (*head)->next = NULL;                           // no next pointer yet
}


void DestroyList( sLINK * * head )
{
    // Destroy an entire linked list

    sLINK * l = *head;
    sLINK * temp;

    for(;;)                                     &nb sp;   // loop to destroy
    {   if( l->data )   free( l->data );            // each node of the list

        if( l->next )
        {   temp = l;
            l = l->next;
            free( temp );                           // thus freeing memory
        }
        else    break;
    }

    *head = NULL;                                   // destroy the head pointer
}


sLINK * AddNode( sLINK * head, void * data, int data_size )
{
    // Add a node to the linked list

    sLINK * next = head;
    sLINK * last;

    do
    {   last = next;
        next = GetNext( next );
    }   while( next );                      // go to the end of the list

    // next == NULL, therefore last == the last node

    if( last->data == NULL )
    {   next = last;
    }
    else
    {   next = (sLINK *) malloc( sizeof(sLINK) );

        if( ! next )
        {   printf("Memory allocation error.\n");
            exit(-1);
        }
        last->next = next;
    }

    next->data = (void *) malloc( data_size );

    if( ! next->data )
    {   printf("Memory allocation error.\n");
        exit(-1);
    }

    memcpy( next->data, data, data_size );

    next->next = NULL;

    return ((sLINK *) next);
}

int CountNodes( sLINK * head )
{
    // Return the total number of nodes in the linked list

    int count = 0;

    do
    {   head = GetNext( head );
        count++;
    }   while( head );

    return count;
}

sLINK * GetNext( sLINK * l )
{
    // Given one node of the list, return a pointer to the next node if it
    // exists, or NULL if it doesn't.

    if( l->next != NULL ) return ((sLINK *) l->next);
    else                  return NULL;
}


sLINK * FindNode( sLINK * head, void * data, int offset, int data_size )
{
    // Compare "data" to the value at "offset" in the data structure portion
    // of the linked list, and return a pointer to the node which contains
    // this value if there is one.

    for(;;)
    {   if( head->data != NULL )
        {   if( memcmp( (char *) head->data + offset, (char *) data, data_size ) == Â0 )
            {   return ( (sLINK *) head );
            }
            if( head->next )    head = head->next;
            else                return NULL;
        }
        else
        {   return NULL;
        }
    }
}

This program is a good starting point, but ideally you'd like to be able to have it compiled automatically. As mentioned before, access logs are often several megabytes (some can be several hundred megabytes!) in size, so the idea of generating these kinds of statistics in real-time every time the user accesses the on-line summary page is unfeasible on most computer systems. The best solution is to have these summaries created in the background of the Web server on a regular basis, so users always get a reasonably current set of information and don't have to wait for several minutes while it processes the access log file. There's a UNIX program called crontab that allows you to schedule events (such as the execution of your program) in the background. Here's how it works. First, you need to ensure that you (and not the Web server process) has access to crontab; contact your UNIX admin to let him or her know of your requirement.

Caution
In general, the Web server process should have access to exactly what it needs access to-nothing more and nothing less. Remember that if a rogue user gains control of the Web server process (via a false crontab file or some other means), then he or she would be able to effectively execute privileged commands with total anonymity-something which is never a good situation on a computer system.

After you've set up your crontab access, you should edit your crontab file and add a line similar to the following:

* 06 * * * /usr/home/big/anadas/cgiunleashed/auto-make

You should read your system's man page for crontab to ensure that you have your crontab file set up correctly.

Now that you've got crontab set up, you'll need to have an access log summary program that produces a Web-viewable summary.

Environment Variables

The Web server's access log feature functions by recording information about the user who is visiting your server, which is sent from the user's own browser. While the information the access log records is very useful, it is by no means an exhaustive account of everything the browser "tells" the Web server about itself and the user.

Let's take a look at the output of the environment variables program first used in Chapter 12, "Imagemaps" (program is available on-line at http://www.anadas.com/cgiunleashed/imagemaps/exe/showenv.cgi):

SERVER_SOFTWARE=ncSA/1.5
GATEWAY_INTERFACE=CGI/1.1
DOCUMENT_ROOT=/usr/home/big/anadas
REMOTE_ADDR=199.45.70.220
SERVER_PROTOCOL=HTTP/1.0
REQUEST_METHOD=GET
REMOTE_HOST=tc220.wwdc.com
QUERY_STRING=
HTTP_USER_AGENT=Mozilla/3.0b5a (Win95; I)
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin:/usr/contrib/bin:/usr/X11/bin
HTTP_CONNECTION=Keep-Alive
HTTP_AccEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
SCRIPT_NAME=/cgiunleashed/imagemaps/exe/showenv.cgi
SERVER_NAME=www.anadas.com
SERVER_PORT=80
HTTP_HOST=www.anadas.com
SERVER_ADMIN=shuman@anadas.com

This is the complete set of environment variable information for the Web server process on this particular server, when a particular user accessed the script in question. Most of these variables are passed from the browser to the Web server, via the CGI interface. Note, however, that some of the variables are set entirely on the Web server's end, for the benefit of CGI programs that need to know additional information about their environment. So what do these environment variables mean?

SERVER_SOFTWARE: This indicates the actual Web server software, which in this case is ncSA httpd version 1.5.
GATEWAY_INTERFACE: This is the level of CGI compatibility supported by the server, which in this case is 1.1.
DOCUMENT_ROOT: This is also a server-set environment variable. It indicates the location of the root document for the Web server (http://www.anadas.com).
REMOTE_ADDR: This environment variable is passed by the browser and indicates the IP address of the browser's Internet connection.
SERVER_PROTOCOL: This environment variable is set by the browser and indicates the HTTP compatibility level.
REQUEST_METHOD: This environment variable is set by the browser according to the kind of query it has sent to the Web server. Normal document and file retrievals are classified as GET queries.
REMOTE_HOST: This environment variable is sent by the browser and indicates the hostname associated with its IP address, if applicable.
QUERY_STRING: This environment variable is set according to the information that is passed by the query. In the case of a GET query, the query string consists of whatever information is after the question mark (?) in the URL.
HTTP_USER_AGENT: This environment variable allows the browser to tell the server what its product name and version number are.
PATH: Every UNIX user has a path associated with his or her login, and the Web server process is no exception.
HTTP_CONNECTION: This environment variable is set by the Web browser to tell the server whether or not it supports a keep-alive connection.
HTTP_AccEPT: This environment variable allows the Web browser to tell the Web server the different data formats it accepts in-line (plug-ins not included).
SCRIPT_NAME: This environment variable is set by the Web server and identifies the script that is being run.
SERVER_NAME: This environment variable is set by the Web server and identifies the Web server's hostname.
SERVER_PORT: This environment variable is set by the Web server and identifies the port address the server is "listening to" for connections.
HTTP_HOST: This environment variable indicates the hostname of the Web server's host.
SERVER_ADMIN: This environment variable, set by the Web server, indicates the e-mail address of the Web server administrator.
AUTH_TYPE: If the server supports user authentication, and the script is protected, this is the protocol-specific authentication method used to validate the user.
REMOTE_USER: If the server supports user authentication, and the script is protected, this is the username they have authenticated as.
REMOTE_IDENT: If the HTTP server supports RFC 931 identification, this variable will be set to the remote username retrieved from the server.
DOCUMENT_NAME: The current filename.
DOCUMENT_URL: The virtual path to the document.
QUERY_STRING_UNESCAPED: The unescaped version of any search query the client sent, with all shell-special characters escaped with \.
DATE_LOCAL: The current date and local time zone. Subject to the timefmt parameter to the config command.
DATE_GMT: Same as DATE_LOCAL but in Greenwich Mean Time.
LAST_MODIFIED: The last modification date of the current document. Subject to timefmt like the others.

Note that not all of these variables appear on the sample output. This is because different servers and browser combinations created different environment variables. Netscape Navigator, Microsoft Internet Explorer, and many other Web browsers each put their own spin on environment variables, and either provide more environment variables or send richer information in the aforementioned variables. For example, Internet Explorer sends the current screen resolution in the browser-type environment variable. This allows dynamically generated Web pages to optimize their appearance for a particular screen size.

Can I Get E-Mail Addresses?
One of the questions most often puzzled over by CGI programmers is whether or not they can obtain a user's e-mail address. Creators of browser software are very sensitive to this issue, and the answer is, in most cases, no. There are certain browsers that pass along this information, at least to some extent.
Some browsers that return full e-mail address information are
  • ncSA Mosaic for Macintosh 2.0a17
  • ncSA Mosaic for Macintosh 2.0a8
  • MCom Netscape 0.9 beta (X, Mac, Windows)
A browser that returns the username is:
  • MCom Netscape 0.9 beta (X only)

The method by which environment variables are extracted in C is presented in Listing 21.4, which is essentially the C version of the showenv.cgi program.


Listing 21.4. Source code for the Web server environment variable printer.
// getenv.cpp -- Web Server Environment Variable Printer
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program displays all of the environment variables available to the
// web server when a user accesses this program via the CGI interface
//
// By Shuman Ghosemajumder, Anadas Software Development

#include <stdio.h>

int main(int argc, char *argv[], char *env[]);

int main(int argc, char *argv[], char *env[])
{
    int count;

    printf("Content-type: text/html\n\n");

    printf("<HTML><TITLE>Environment Variables</TITLE><BODY>\n");

    printf("<H1>Web Server Environment Variables</H1><ul>\n");

    for(count=0;env[count];)
    {   printf("<B>Var %d.</B> %s<BR>\n", count, env[count++] );
    }

    printf("</ul></BODY></HTML>\n");

    return(0);   // exit gracefully
}

Creating a Pseudo Access Log File

Having the ability to parse ready-made server access logs is wonderful, but what if you don't have access to those logs? As long as you can execute CGI scripts, you can create your own logs dynamically. Listing 21.5 is an example of a program that generates a "Pseudo Access Log File" every time it is loaded. This program creates a log file similar to the server log files, but with richer information.


Listing 21.5. Source code for the make log program.
// makelog.cpp -- MAKE LOG PROGRAM
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program creates a log file similar to the server log files, just
// with richer information.
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// GENERAL ALGORITHM
//
// 1. Get the desired environment variables
//
// 2. Write them to a file!

// IncLUDES ***********************************************************

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

// DEFINES AND STRUCTURES *********************************************

#define MAX_STRING  256
#define DATE_STRING 32
#define HOUR_STRING 5

#define LOG_FILE "./pseudo-log"

// FUncTION PROTOTYPES ************************************************

int main(int argc, char *argv[], char *env[]);
void SafeGetEnv( char * env_name, char * * ptr, char * null_string );

// FUncTIONS **********************************************************

int main(int argc, char *argv[], char *env[])
{
    char * browser,
         * hostname,
         * refer_url;
    char date[32];
    char empty_string[1];
    time_t bintime;

    time(&bintime);
    sprintf( date,"%s\0", ctime(&bintime) );
    date[24] = '\0';                      // exactly 24 chars in length

    empty_string[0] = '\0';

    SafeGetEnv( "REMOTE_HOST", &hostname, empty_string );
    SafeGetEnv( "HTTP_REFERER", &refer_url, empty_string );
    SafeGetEnv( "HTTP_USER_AGENT", &browser, empty_string );

    FILE * fp;

    fp = fopen( LOG_FILE, "a" );

    fprintf( fp, "%s %s %s %s\n", date, hostname, refer_url, browser );

    fclose( fp );

    return (0); // exit gracefully
}

void SafeGetEnv( char * env_name, char * * ptr, char * null_string )
{
    // Normally a NULL pointer is returned if a certain environment variable
    // doesn't exist and you try to retrieve it.  This function set the value
    // of the pointer to point at a NULL string instead.

     char * tmp;

     tmp = getenv( env_name );

     if( ! tmp )  *ptr = null_string;
     else         *ptr = tmp;
}

Logging Accesses

Now that we have a program to extract environment variable information, we're in much the same situation we were in when we simply had access to the access log file. We can create a huge log file of the various environment variable information we wish to keep track of, but the raw information isn't very useful unless we summarize it and have the output visible through the Web.

Listing 21.6 is a program that parses the pseudo access log created by the program in Listing 21.5. This program reads in the pseudo access log file generated by makelogg.cpp and generates an HTML as output. The document summarizes all of the raw information presented in that access log into useful categories. Figure 21.2 shows some sample output from it.

Figure 21.2: A sample shot of the output from the Pseudo Access Log Summary program


Listing 21.6. Source code listing for the Pseudo Access Log Summary program.
// parselog.cpp -- AccESS LOG SUMMARY PROGRAM for "MAKE LOG"
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program reads in the pseudo access log file generated by parselog.cpp
// and generates an HTML document as output.  The document summarizes all of
// the raw information presented in that access log into useful categories.
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// The categories it summarizes information for:
//
// * # of hits by domain
// * # of hits by referrer
// * # of hits by date
// * # of hits by browser
//
// GENERAL ALGORITHM
//
// 1. For each domain and file path, dynamically create a linked list
//    for each value, and add 1 to the hit count each time.
//
// 2. Create a linked list for each date, as well as each hour also.
//
// 3. Send the output to stdout.


// IncLUDES ***********************************************************

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include "linklist.h"           // Linked List Header File

#include "linklist.cpp"         // Linked List Functions

// DEFINES AND STRUCTURES *********************************************

#define MAX_STRING  256
#define DATE_STRING 32
#define HOUR_STRING 5

#define LOG_FILE "./pseudo-log"

typedef struct
{   char refer[MAX_STRING];
    int num_access;
} sREFER;

typedef struct
{   char browser[MAX_STRING];
    int num_access;
} sBROWSER;

typedef struct
{   char hostname[MAX_STRING];
    int num_access;
} sHOSTNAME;

typedef struct
{   char date[DATE_STRING];
    int num_access;
} sDATE;


// FUncTION PROTOTYPES ************************************************

int main(int argc, char *argv[], char *env[]);
void ProcessLine( char * line );
void PrintOutput( void );
void InitAll(void);
void DestroyAll(void);

// GLOBAL VARIABLES ***************************************************

sLINK * link_hostname;
sLINK * link_date;
sLINK * link_refer;
sLINK * link_browser;


// FUncTIONS **********************************************************

int main(int argc, char *argv[], char *env[])
{
    printf("Content-type: text/html\n\n");
    printf("<HTML><TITLE>Pseudo Access Log Summary</TITLE><BODY>\n");
    printf("<H1>Pseudo Access Log Summary</H1>\n");

    FILE * fp;

    fp = fopen( LOG_FILE, "r" );                // open the access log file

    if( ! fp )
    {   printf("ERROR: Couldn't load log file!");    // abort painlessly
    }
    else                                         // if able to load file...
    {   char line[512];

        InitAll();

        for(;;)
        {   // fetch lines until EOF encountered

            if( fgets( line, 511, fp ) == NULL ) break;

            ProcessLine( line );        // extract the important information
        }

        PrintOutput();              // send the output to stdout
    }

    DestroyAll();

    printf("</ul></BODY></HTML>\n");    // end the HTML file

    return(0); // exist gracefully
}
void InitAll(void)
{
    // Initialize the head pointers

    InitHead( &link_hostname );
    InitHead( &link_refer );
    InitHead( &link_browser );
    InitHead( &link_date );
}

void DestroyAll(void)
{
    // Destroy the linked lists and free memory

    DestroyList( &link_hostname );
    DestroyList( &link_refer );
    DestroyList( &link_browser );
    DestroyList( &link_date );
}


void ProcessLine( char * line )
{
    // Process a single line of the pseudo access log file

    sHOSTNAME hn;
    sREFER rf;
    sBROWSER bs;
    sDATE dt;

    char * left, * right;
    sLINK * l;

    // Line Structure:
    //
    // get the date (24 chars)
    // get a space
    // get the hostname
    // get a space
    // get the refering URL
    // get a space
    // get the browser type (the remainder of the line)

    left = line;

    right = (char *) left + 10;

    memcpy( dt.date, left, right-left );
    *(dt.date + (right-left) ) = '\0';

    l = FindNode( link_date, (void *) &dt, 0, strlen( dt.date ) );

    if( ! l )
    {   dt.num_access = 1;

        AddNode( link_date, (void *) &dt, sizeof(sDATE) );
    }
    else
    {   ((sDATE *) l->data)->num_access++;
    }

    left = &line[25];     // skip the hour and the space

    right = strchr( left, ' ' );        // find the next space

    if( ! right ) return;               // bad entry

    memcpy( hn.hostname, left, right-left );    // get the first one
    *(hn.hostname + (right-left) ) = '\0';

    l = FindNode( link_hostname, (void *) &hn, 0, strlen( hn.hostname ) );

    if( ! l )
    {   hn.num_access = 1;

        AddNode( link_hostname, (void *) &hn, sizeof( sHOSTNAME ) );
    }
    else
    {   ((sHOSTNAME *) l->data)->num_access++;
    }

    left = right+1;                 // skip the space
    right = strchr( left, ' ' );    // find the next space (filename with path)
    if( ! right ) return;           // bad entry

    memcpy( rf.refer, left, right-left );    // get the first one
    *(rf.refer + (right-left) ) = '\0';

    l = FindNode( link_refer, (void *) &rf, 0, strlen( rf.refer ) );

    if( ! l )
    {   rf.num_access = 1;
        AddNode( link_refer, (void *) &rf, sizeof( sREFER ) );
    }
    else
    {   ((sREFER *) l->data)->num_access++;
    }

    left = right+1;                 // skip the space
    right = strchr( left, '\n' );   // find the end
    if( ! right ) return;           // bad entry

    memcpy( bs.browser, left, right-left );    // get the first one
    *(bs.browser + (right-left) ) = '\0';

    l = FindNode( link_browser, (void *) &bs, 0, strlen( bs.browser ) );

    if( ! l )
    {   bs.num_access = 1;
        AddNode( link_browser, (void *) &bs, sizeof( sBROWSER ) );
    }
    else
    {   ((sBROWSER *) l->data)->num_access++;
    }
}


void PrintOutput( void )
{
    // Send the output of the program to stdout

    sLINK * l;

    l = link_date;

    printf("<H2>By Date</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sDATE *) (l->data))->date,
                                ((sDATE *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n");

    l = link_hostname;

    printf("<H2>By Hostname</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sHOSTNAME *) (l->data))->hostname,
                                ((sHOSTNAME *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n");

    l = link_refer;

    printf("<H2>By Referer</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B><a href=\"%s\">%s</a> :</B> %d\n",
                    ((sREFER *) (l->data))->refer,
                    ((sREFER *) (l->data))->refer,
                    ((sREFER *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n");

    l = link_browser;

    printf("<H2>By Browser</H2>\n");
    printf("<ul>\n");

    for(;l;)
    {   if( l->data )
        {   printf("<li> <B>%s :</B> %d\n", ((sBROWSER *) (l->data))->browser,
                                ((sBROWSER *) (l->data))->num_access );
            l = l->next;
        }
        else    break;
    }
    printf("</ul>\n"
);
}

This program can also be run on a regular basis via crontab, and thus users always have access to relatively current information. If it's critical that users have access to immediate information, you can create an access log program that uses some sort of database management system to find pre-existing "user records" (sorted perhaps on hostname or IP address) and adds information to that "user profile." Thus, the information would always be in a summarized format, and the on-line reader program would simply display the file's contents.

How to Implement Tracking CGIs

Up until now, you may not have given much thought to exactly how your Web server was allowing you to run CGIs. But consider that the programs you've seen so far in this chapter deal with user information that the regular visitor to your Web site would most likely never see. Surely you're not going to make them visit a URL they have no interest in visiting simply so you can store their information! Yet that's exactly what you'd be forced to do if you called your tracking CGIs via a URL that references a program in the /cgi-bin/ directory. Clearly, it's important for the tracking process to be completely transparent to the users yet still work just as efficiently for you. There's more than one way you can accomplish this.

index.cgi

Your Web server is probably set up in such a manner that if you have a file named index.html or perhaps home.html in a specific directory, then that is the HTML file which is loaded by the server and displayed to the browser if the user attempts to load a URL in which the directory name, but not the exact file, is specified. On just about every single Web server, there is an option that can be set (in the srm.conf file on ncSA httpd compatible Web servers) that allows index.cgi to be the default file that is loaded. This allows you to actually run a CGI script every time a user accesses the base document in any directory-while the user sees an HTML file as usual! The easiest way to accomplish this is to make index.cgi a shell script such as

#!/bin/sh
./logapp
echo Content-type: text/html
echo
cat real-home.html

First, the logging program (logapp) is called to store the user information into a file. The log program doesn't actually produce any output, and it has full access to the environment variable information that any explicitly called CGI script would. Then, the two echo commands send the HTTP command to the Web browser that an HTML document is coming forth, after which the actual home document for that directory is sent to the browser. This is the most preferable method because it allows you the greatest degree of control, with the ability to not only execute CGI applications, but also to send direct HTTP commands.

index.shtml

If your server has server-side includes enabled, you can create a .shtml (server-parsed HTML file), which allows you to call a CGI from within the HTML file. You can use the following syntax to invoke a CGI this way:

<!--#exec cmd="Application"-->

Or, if you must execute programs from cgi-bin, use

<!--#exec cgi="CGI Program"-->

Including CGIs in Images

If your server has support for neither index.cgi nor index.shtml, you can still create a user-tracking CGI application that is automatically executed when you access a Web site, but it is slightly more limited. You can create a CGI shell script in your cgi-bin directory that looks something like this:

#!/bin/sh
./logapp
echo Content-type: image/gif
echo
cat image.gif

This program sends an image on the Web server to the browser but first executes the user logging application transparently. You would execute this script by including its URL in the Web page you wanted to monitor as an image. For example:

<img src="http://www.anadas.com/cgi-bin/log-image.cgi">

This would display an image on the Web browser, while your logging application would get executed every time the page was loaded-totally transparent to visitors to your site.

A Simple Web Counter

The idea of sending an image to the Web browser while "secretly" running a logging application need not be so secret. In fact, many logging applications prefer to return a custom image file that displays information such as the current number of hits to that Web page. You may have seen odometer-like images on some Web sites and wondered how you might create your own. You could certainly use one of the services on the Internet such as www.digits.com, which allows you to use their CGI application to both log your hits and display the fancy graphic, but you now have the tools to create your own such counter.

Listing 21.7 is an example of a simple Web counter. Its output is depicted in Figure 21.3.

Figure 21.3: Sample screen shot of the output from the graphical Web counter.


Listing 21.7. Source code listing for the graphical Web counter script.
// counter.cpp  --  a graphical counter for a web page, to be included through
//                  an IMG tag in an HTML document
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// Written by Shuman Ghosemajumder, Anadas Software Development
//
// General Algorithm:
//
//  1. Determine the filename to be read from / written to.
//  2. Update the counter data.
//  3. Convert the current count to an X-bitmap.
//  4. Output that X-bitmap to stdout

// IncLUDE FILES ************************************************************

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

// DEFINES / PROTOTYPES *****************************************************

#define DIGIT_WIDTH 8
#define DIGIT_HEIGHT 12
#define NUM_DIGITS 6
#define DATA_FILENAME "counter.dat"

int main(int argc, char *argv[], char *env[]);

// GLOBAL VARIABLES *********************************************************

char *xbmp_digits[10][12] =  {
  {"0x7e", "0x7e", "0x66", "0x66", "0x66", "0x66",
           "0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e"},
  {"0x18", "0x1e", "0x1e", "0x18", "0x18", "0x18",
           "0x18", "0x18", "0x18", "0x18", "0x7e", "0x7e"},
  {"0x3c", "0x7e", "0x66", "0x60", "0x70", "0x38",
           "0x1c", "0x0c", "0x06", "0x06", "0x7e", "0x7e"},
  {"0x3c", "0x7e", "0x66", "0x60", "0x70", "0x38",
           "0x38", "0x70", "0x60", "0x66", "0x7e", "0x3c"},
  {"0x60", "0x66", "0x66", "0x66", "0x66", "0x66",
           "0x7e", "0x7e", "0x60", "0x60", "0x60", "0x60"},
  {"0x7e", "0x7e", "0x02", "0x02", "0x7e", "0x7e",
           "0x60", "0x60", "0x60", "0x66", "0x7e", "0x7e"},
  {"0x7e", "0x7e", "0x66", "0x06", "0x06", "0x7e",
           "0x7e", "0x66", "0x66", "0x66", "0x7e", "0x7e"},
  {"0x7e", "0x7e", "0x60", "0x60", "0x60", "0x60",
           "0x60", "0x60", "0x60", "0x60", "0x60", "0x60"},
  {"0x7e", "0x7e", "0x66", "0x66", "0x7e", "0x7e",
           "0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e"},
  {"0x7e", "0x7e", "0x66", "0x66", "0x7e", "0x7e",
           "0x60", "0x60", "0x60", "0x66", "0x7e", "0x7e"}
};


int main(int argc, char *argv[], char *env[])
{
    char filename[256];                 // the data filename
    int xbmp_count[NUM_DIGITS+1];       // the image buffer for the counter
    unsigned long count;                // the variable to store the counter
    int i, j;                           // Looping variables

    if ( argc >= 2 )
    {   // if there is a command line parameter (passed after the ? operator
        // in the GET query), then the filename to store the data in should
        // be that parameter plus a ".dat" extension.

        sprintf( filename, "%s.dat", argv[1] );
    }
    else
    {   // Otherwise, use the default filename

        strcpy( filename, DATA_FILENAME );
    }

    FILE * fp;

    if( ! (fp = fopen( filename, "rb" )) )      // try to open the file
    {   count = 0;                              // if failure, reset counter
    }
    else                                                 // if success,
    {   fread( &count, sizeof(unsigned long), 1, fp );  // read in the counter
        fclose(fp);
    }

    if( fp = fopen( filename, "wb" ) )
    {   fwrite( &(++count), sizeof(unsigned long), 1, fp );   // update counter
        fclose(fp);
    }

    printf("Content-type:image/x-xbitmap\n\n");             // the HTTP header

    // Separate the digits of the current counter value

    xbmp_count[NUM_DIGITS] = '\0';

    for( i=0;i< NUM_DIGITS;i++)
    {   j = count % 10;
        xbmp_count[NUM_DIGITS-1-i] = j;
        count /= 10;
    }

    printf("#define counter_width %d\n",NUM_DIGITS*DIGIT_WIDTH);
    printf("#define counter_height %d\n\n",DIGIT_HEIGHT);

    // send the X-Bitmap information to stdout

    printf("static char counter_bits[] = {\n");

    for(i=0;i < DIGIT_HEIGHT; i++)
    {
        for(j=0;j < NUM_DIGITS; j++)
        {
            printf("%s", xbmp_digits[xbmp_count[j]][i] );

            if( (i < DIGIT_HEIGHT-1 ) || ( j< NUM_DIGITS-1 ) )
            {   printf(", ");
            }
        }
        printf("\n");
    }
    printf("}\n");
}

Calling counter.cgi

Listing 21.8 shows some sample HTML code for a hypertext page that uses the graphical Web counter for invoking counter.cgi within a hypertext document.


Listing 21.8. HTML source code for a hypertext page.
<HTML>
<!-- http://www.anadas.com/cgiunleashed/trackuser/counter.html

     By Shuman Ghosemajumder, Anadas Software Development --!>

<TITLE>Graphical Web Counter</TITLE>

<BODY>

    <H1>Graphical Web Counter</H1>

    <ul><li><B>

    This page has been accessed
    <img src="http://www.anadas.com/cgiunleashed/trackuser/Âcounter.cgi?counter.html"> times.

    </B></ul>

</BODY></HTML>

Locating Users Geographically

So far you've noticed that we're able to keep track of a great deal of information about visitors to our sites, but most of it is very "computer-related" rather than "real-world." In other words, it's great to know what their IP address, their hostname, and their HTTP-acceptance parameters are, but it's even better to know where they're dialed-in from, or even better, their name. It should already be quite clear that determining a user's name or e-mail address is very near impossible to do on anything remotely resembling a consistent basis, so any such notions are purely fanciful. Determining their general geographic location, however, is a piece of real-world information that is much more realistically attainable.

Discussion of Feasibility

The location from which a user is dialed-in (or directly connected to the Internet) is a piece of information that is most definitely not passed through any kind of environment variable. In fact, the vast majority of Web browsing programs probably don't have a clue as to where they're running from; one hard drive is just the same as any other to a freshly downloaded copy of Netscape or Internet Explorer, for example. There are two pieces of information you can use to determine geographic information, however: the hostname and the IP address.

The hostname can immediately provide some important, and almost guaranteed correct, geographic information via the first-level domain. Internet domains work from right to left, so that the first-level domain is represented by the rightmost string, the second-level domain is represented by the value to the left of that, and so on. For example, in the address www.anadas.com, com is the first-level (or top-level) domain, while anadas.com is the second-level domain. The first-level domains are decidedly finite in number and determine either the geographical location or the nature of the organization. For example, .com denotes a commercial organization, while a .ca extension denotes an organization in Canada. The various first-level domains are as follows:

Code
Country
Code
Country
AD
Andorra
LS
Lesotho
AE
United Arab Emirates
LT
Lithuania Ex-USSR
AF
Afghanistan
LU
Luxembourg
AG
Antigua and Barbuda
LV
Latvia
AI
Anguilla
LY
Libya
AL
Albania
MA
Morocco
AM
Armenia Ex-USSR
MC
Monaco
AN
Netherland Antilles
MD
Moldavia Ex-USSR
AO
Angola
MG
Madagascar
AQ
Antarctica
MH
Marshall Islands
AR
Argentina
ML
Mali
AS
American Samoa
MM
Myanmar
AT
Austria
MN
Mongolia
AU
Australia
MO
Macau
AW
Aruba
MP
Northern Mariana Isl.
AZ
Azerbaidjan Ex-USSR
MQ
Martinique (Fr.)
BA
Bosnia-Herzegovina Ex-Yugoslavia
MR
Mauritania
BB
Barbados
MS
Montserrat
BD
Bangladesh
MT
Malta
BE
Belgium
MU
Mauritius
BF
Burkina Faso
MV
Maldives
BG
Bulgaria
MW
Malawi
BH
Bahrain
MX
Mexico
BI
Burundi
MY
Malaysia
BJ
Benin
MZ
Mozambique
BM
Bermuda
NA
Namibia
BN
Brunei Darussalam
nc
New Caledonia (Fr.)
BO
Bolivia
NE
Niger
BR
Brazil
NF
Norfolk Island
BS
Bahamas
NG
Nigeria
BT
Buthan
NI
Nicaragua
BV
Bouvet Island
NL
Netherlands
BW
Botswana
NO
Norway
BY
Belarus Ex-USSR
NP
Nepal
BZ
Belize
NR
Nauru
CA
Canada
NT
Neutral Zone
cc
Cocos (Keeling) Isl.
NU
Niue
CF
Central African Rep.
NZ
New Zealand
CG
Congo
OM
Oman
ch
Switzerland
PA
Panama
CI
Ivory Coast
PE
Peru
CK
Cook Islands
PF
Polynesia (Fr.)
CL
Chile
PG
Papua New Guinea
CM
Cameroon
PH
Philippines
CN
China
PK
Pakistan
CO
Colombia
PL
Poland
CR
Costa Rica
PM
St. Pierre & Miquelon
CS
Czechoslovakia
PN
Pitcairn
CU
Cuba
PT
Portugal
CV
Cape Verde
PR
Puerto Rico (US)
CX
Christmas Island
PW
Palau
CY
Cyprus
PY
Paraguay
CZ
Czech Republic
QA
Qatar
DE
Germany
RE
Reunion (Fr.)
DJ
Djibouti
RO
Romania
DK
Denmark
RU
Russian Federation Ex-USSR
DM
Dominica
RW
Rwanda
DO
Dominican Republic
SA
Saudi Arabia
DZ
Algeria
SB
Solomon Islands
EC
Ecuador
SC
Seychelles
EE
Estonia Ex-USSR
SD
Sudan
EG
Egypt
SE
Sweden
EH
Western Sahara
SG
Singapore
ES
Spain
SH
St. Helena
ET
Ethiopia
SI
Slovenia Ex-Yugoslavia
FI
Finland
SJ
Svalbard & Jan Mayen Isl.
FJ
Fiji
SK
Slovak Republic
FK
Falkland Isl.(Malvinas)
SL
Sierra Leone
FM
Micronesia
SM
San Marino
FO
Faroe Islands
SN
Senegal
FR
France
SO
Somalia
FX
France (European Ter.)
SR
Suriname
GA
Gabon
ST
St. Tome and Principe
GB
Great Britain
SU
Soviet Union
GD
Grenada
SV
El Salvador
GE
Georgia Ex-USSR
SY
Syria
GH
Ghana
SZ
Swaziland
GI
Gibraltar
TC
Turks & Caicos Islands
GL
Greenland
TD
Chad
GP
Guadeloupe (Fr.)
TF
French Southern Terr.
GQ
Equatorial Guinea
TG
Togo
GF
Guyana (Fr.)
TH
Thailand
GM
Gambia
TJ
Tadjikistan Ex-USSR
GN
Guinea
TK
Tokelau
GR
Greece
TM
Turkmenistan Ex-USSR
GT
Guatemala
TN
Tunisia
GU
Guam (US)
TO
Tonga
GW
Guinea Bissau
TP
East Timor
GY
Guyana
TR
Turkey
HK
Hong Kong
TT
Trinidad & Tobago
HM
Heard & McDonald Isl.
TV
Tuvalu
HN
Honduras
TW
Taiwan
HR
Croatia Ex-Yugoslavia
TZ
Tanzania
HT
Haiti
UA
Ukraine Ex-USSR
HU
Hungary
UG
Uganda
ID
Indonesia
UK
United Kingdom
IE
Ireland
UM
US Minor outlying isl.
IL
Israel
US
United States
IN
India
UY
Uruguay
IO
British Indian O. Terr.
UZ
Uzbekistan Ex-USSR
IQ
Iraq
VA
Vatican City State
IR
Iran
VC
St. Vincent & Grenadines
IS
Iceland
VE
Venezuela
IT
Italy
VG
Virgin Islands (British)
JM
Jamaica
VI
Virgin Islands (US)
JO
Jordan
VN
Vietnam
JP
Japan
VU
Vanuatu
KE
Kenya
WF
Wallis & Futuna Islands
KG
Kirgistan Ex-USSR
WS
Samoa
KH
Cambodia
YE
Yemen
KI
Kiribati
YU
Yugoslavia
KM
Comoros
ZA
South Africa
KN
St. Kitts Nevis Anguilla
ZM
Zambia
KP
Korea (North)
ZR
Zaire
KR
Korea (South)
ZW
Zimbabwe
KW
Kuwait
ARPA
Old-style Arpanet
KY
Cayman Islands
COM
Commercial
KZ
Kazachstan Ex-USSR
EDU
Educational
LA
Laos
GOV
Government
LB
Lebanon
INT
International
LC
Saint Lucia
MIL
US Military
LI
Liechtenstein
NATO
Nato
LK
Sri Lanka
NET
Network
LR
Liberia
ORG
Non-Profit Organization

If you're lucky enough to get a user whose hostname contains one of the geographical top-level domains, you can easily match the extension against the preceding table and determine which country he or she is from. However, the vast majority of users on the Internet are likely going to be accessing your site from a .com, .org, .edu, or .net domain. These domains are administered by InterNIC and can be given to organizations and institutions all over the world. Thus, the domain name alone doesn't provide us with their geographical location.

Introduction to NSLOOKUP and WHOIS

This is where the InterNIC database itself comes in. Whenever an organization is administered a domain name by InterNIC, a record is kept of various information about that organization on InterNIC's own computer system. InterNIC is kind enough to allow the public access to this information, and the speed and ease by which one can access it is excellent. The InterNIC whois database can be accessed with the following command:

whois -h rs.internic.net [domain name]

where [domain name] is the name of the domain you want further information on. Remember that in order to be able to find any information in InterNIC's database on a domain, that domain must have been directly administered by InterNIC. Thus, trying to access information on a .ca domain (which is administered by the CA domain registration committee in Canada) is quite futile. Here is an example of the output from a whois query on the domain name anadas.com:

Anadas Software Development (ANADAS-DOM)
38 Grasmere Crescent
London, Ontario N6G 4N8
CANADA

Domain Name: ANADAS.COM

Administrative Contact, Billing Contact:
Ghosemajumder, Shuman  (SG331)  shuman@ANADAS.COM
(519) 858-0021
Technical Contact, Zone Contact:
Dice, Richard (RD78) rdice@ANADAS.COM
(519) 858-0021

Record last updated on 10-Jun-96.
Record created on 15-Jul-95.

Domain servers in listed order:

NS.ANADAS.COM                  199.45.70.4
NS.UUNET.CA                  142.77.1.1
AUTH01.NS.UU.NET             198.6.1.81

Note that originally all we had was the hostname (anadas.com), yet now we have the company's country of origin, their province, and even their street address! In addition, we have contact names and even phone numbers! Of course, there's no guarantee that the individual user at the given address is going to be one of the InterNIC registration contacts; in fact, for most organizations, the odds are quite against it. But we do know the country associated with this organization, so we can record it as an access from Canada.

In many cases, the information on a particular hostname may be difficult to find on InterNIC's whois server because the domain is administered by a parent organization. Or perhaps you might have a numerical IP address that is sent as the hostname field. In these instances, you must do a whois lookup on the IP address itself, another query format supported by InterNIC's whois server.

In the case of a domain that is administered by a parent organization, it's useful to use nslookup to determine the IP address of the actual machine. The format for calling nslookup is

nslookup [hostname]

In this case, doing a lookup on www.anadas.com yields the following output:

Name:     www.anadas.com
Address:  199.45.70.165

The IP address will always have four numbers separated by three periods, and the fourth number can always be ignored because it is resolved by the DNS server local to that domain. So we then do a whois query on 199.45.70, which yields the same information as before (or the information for the controlling organization we're looking for). Note that if this information is not available, we can strip off the next number and do a lookup on 199.45, which will give an even more generalized answer.

The information returned by InterNIC is in a relatively standardized format that is easily machine-parsable to allow you to create programs that automatically log additional information based on the hostname or IP address.

Limitations of Tracking Users Through IP Addresses

Tracking user's geographical locations by using the IP address or hostname as the basis for an InterNIC whois query works in most cases, but certainly not in all. Consider the case of an Internet Service Provider (ISP) based in Houston, which may have points of presence in New York and Los Angeles. The New York users would still have an IP address registered to the company in Houston, but recording their visit as a visit from a person in Houston would be quite erroneous. An example of this, on a much bigger scale, is the case of major on-line services like CompuServe and America Online. These services now provide access to the Internet, but it's all done through proxy servers connected to their centralized network. This means that users all over North America would be reported as connecting from the headquarters of the on-line service they were using rather than where they were really connecting from!

A work-around is to attempt to identify the major on-line services and organizations and build in contingency routines for users from those sites. But in the end, there are no totally definite methods of determining the geographic location of a user when given only an ambiguous IP address or hostname.

Cookies

Until now, we've been discussing methods of determining information about users prior to their visiting your Web site. Details such as their browser type, geographic location, and e-mail address exist before they ever visit your Web site. However, it's often very useful to be able to determine information about users after they've visited your Web site for the first time.

This is an excellent application for cookies. When a user initially visits your site, a cookie is assigned to their browser, which is then sent back to your Web server on each subsequent connect to your site. Thus, you can track information about how many "repeat visitors" your site gets, plus how these repeat visitors use the content on your site.

Listing 21.9 shows an example of a program that tracks users' visits through the use of cookies. Its output is depicted in Figure 21.4.

Figure 21.4: Sample screen shot of the output from the cookie-based counter..


Listing 21.9. Source code listing for the cookie counter script.
// set-cookie.cpp -- SET COOKIE PROGRAM
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program uses cookies to track the number of times a specific user
// has visited the script.
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// GENERAL ALGORITHM
//
// 1. Check whether or not a cookie was passed.
// 2. If one was, increment the counter.  If not, create a blank cookie.
// 3. Re-send the new cookie, blank or otherwise, to the browser.
// 4. Display the relevant output to stdout
//
// Notes: This program uses META HTTP-EQUIV rather than an actual HTTP
//        directive to ensure maximum compatibility.  Certain servers seem
//        to have problems with cookies, but this should work across most
//        platforms.

// IncLUDES ***********************************************************

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

// FUncTION PROTOTYPES ************************************************

int main(int argc, char *argv[], char *env[]);
void SafeGetEnv( char * env_name, char * * ptr, char * null_string );

// FUncTIONS **********************************************************

int main(int argc, char *argv[], char *env[])
{
    char * cookie;
    char empty_string[1];
    char * p;
    int val=0;

    empty_string[0] = '\0';

    SafeGetEnv( "HTTP_COOKIE", &cookie, empty_string );

    printf("Content-type: text/html\n\n");

    printf("<HTML><HEAD>");

    printf("<META HTTP-EQUIV=\"Set-Cookie\" ");

    p = strstr( cookie, "COUNT=" );
    if( ! p )
        printf("Content=\"COUNT=0; expires=01-Jan-99 GMT; path=/cgiunleashed/Âtrackuser; domain=.anadas.com\">\n");
    else
    {   p += strlen("COUNT=");

        char * ps;

        ps = strchr( p, ';');
        *ps = '\0';

        val = atoi( p );
        val++;

        printf("Content=\"COUNT=%d; expires=01-Jan-99 GMT; path=/cgiunleashed/Âtrackuser; domain=.anadas.com\">\n", val);
    }

    printf("<TITLE>Cookie Test</TITLE></HEAD>\n");

    printf("<BODY>\n");

    printf("<H1>Cookie Test!</H1><HR><P>\n");

    if( val > 0 )
    {   printf("<H3>You have been here %d times!</H3>\n", val );
    }
    else
    {   printf("<H3>You have now been assigned a cookie!</H3>\n");
    }

    printf("</BODY></HTML>\n");

    return(0);  // exit gracefully
}

void SafeGetEnv( char * env_name, char * * ptr, char * null_string )
{
    // Normally a NULL pointer is returned if a certain environment variable
    // doesn't exist and you try to retrieve it.  This function sets the value
    // of the pointer to point at a NULL string instead.

    char * tmp;

    tmp = getenv( env_name );

    if( ! tmp )  *ptr = null_string;
    else         *ptr = tmp;
}

Other Methods of Tracking Users

We've discussed several general methods of tracking information about any visitor to our Web site. But what about specific users? The markets for most successful Web sites that aren't incredibly general-purpose themselves (such as search engines or total Internet directories like Yahoo!) are usually very specifically targeted. This means that you already know certain things about the majority of your users, which can give you an advantage in tracking additional information about them.

For example, if you were creating a site for doctors and other health care professionals, you could use a database of all the major hospitals in North America to determine which hostnames and IP addresses correspond to which health care centers.

Fingering Dial-Up Servers

Earlier in the chapter, I stated that you couldn't get a general user's e-mail address on any consistent basis. While this is true, when you have a highly targeted Web site that generates hits from a limited audience, there is the possibility of determining the user's e-mail address-if, and only if, you have the name of the machine where their actual login takes place, and that machine has a publicly accessible finger daemon configured and running.

If you think this sounds like a very specific set of circumstances, you're right. Fortunately, the vast majority of ISPs (Internet Service Providers) and even most standard servers are set up in this manner. The format for the finger command in this case is

finger @hostname

Keep in mind that the hostname is not necessarily the hostname they are accessing your site from. In the case of dial-up users, the hostname they are accessing you from refers to a specific SLIP or PPP port while you're looking for the server that contains the catalog of all SLIP or PPP connections. In the case that the user is accessing your site from a terminal on the reported hostname, you may have better luck. If you do manage to determine the hostname of the server you're looking for, the output will be something like this:

[dialup.anadas.com]
USER       TTY  FROM   LOGIN@   IDLE   WHAT
tsuki      00   borg   11:55AM  54     -su (tcsh)
rxm43      p0   pm66   9:43AM   0      -tcsh (tcsh)
ayondey    p1   alice  11:36AM  30     -su (tcsh)
challaday  p2   tc248  1:07PM   59     -tcsh (tcsh)
damian     p3   lorne  2:17PM   19     /bin/sh /usr/local/bin/mm (mm)
shuman     p4   sky    1:35PM   1      netscape &
rsilver    p5   pm81   2:34PM   0      w

Notice that we're given a complete list of users who are currently on the system in question. We would then determine which of these users was our visitor by looking at the WHAT field to see which user was running a Web browser at the time of our lookup. In this case, we see that user shuman was running Netscape Navigator, so he is the one who was accessing our site.

Caution
This example provides a great deal of information about the user who has accessed your site and will work under only the right, "lucky" circumstances. Nonetheless, acquiring e-mail addresses and then sending junk e-mail (or any other kind of unsolicited e-mail) is considered to be a grievous breach of etiquette and is a practice that should never be adopted.

The Ethics of Tracking Users

This chapter has revealed some very powerful techniques by which you can determine a great deal of information about the visitors to your site. However, as the saying goes, "With great power comes great responsibility," and this topic is no exception to this axiom. Privacy is one of the most important issues that people must address when using the Internet. As Web developers, we must always strive to never compromise the privacy of our audience, for the benefit of the industry as a whole. People use the Internet exactly as much as they trust it-no more. A single case of one user's privacy being compromised can reduce the level of trust of all users immeasurably.

Some excellent on-line resources on these topics include the following:

http://www.yahoo.com/Government/Law/Privacy/
http://www.anu.edu.au/people/Roger.Clarke/DV/
http://www.uiuc.edu/~ejk/WWW-privacy.html

Accessing This Chapter Online

You can access all of the code listings in this chapter, with accompanying executables, by visiting

http://www.anadas.com/cgiunleashed/trackuser/

The site is shown in Figure 21.5.

Figure 21.5: Screen shot of the Web site which contains the listings for this chapter..

Summary

The methods presented in this chapter will allow you to track just about every piece of information which is available about the users who access your Web site. Only you will be able to determine which bits of data are the most useful to you, and you will most likely want to concentrate on tracking those. Note that summarizing raw data is the key to creating truly useful demographic reports. While there are a finite number of types of this raw data, there are many more ways in which you can summarize the data into cumulative categories, emphasizing the interrelationships within the data over the bare facts themselves. In other words, this is only the beginning. Good luck!