There are several different methods you can use to track users. They are
It's easy enough to set up a World Wide Web site for yourself or your organization and gauge its success or failure solely on the amount of response you get via e-mail, phone, or fax. But if you rely on so simplistic a tracking mechanism, you won't get anywhere near the whole picture. Perhaps your site is attracting many visitors, but your response form is hard to find, so very few of them are getting in touch with you. Perhaps many people find your Web site via unrelated searches on Internet search engines and promptly leave. Or perhaps you've optimized your site for Netscape, but the people most interested in your content are using ncSA Mosaic and can't view any of your in-line images! In any of these cases, you could spend a long time waiting for user responses while being totally in the dark about why you weren't getting any responses.
This illustrates why it's so important to track user information on a constant basis. You can gain valuable insights not only into who is accessing your site, but also how they're finding it and where they might have heard of you. Plus, there's the all-important question of the total number of users visiting your site.
How Search Engines Work |
Search engines such as Alta Vista, WebCrawler, InfoSeek, Lycos, and Excite possess vast databases of information, cataloging much of the content on the World Wide Web. Not only is the creation of such a huge database a task more difficult than any group of people could manually accomplish, it's also necessary to update all of the information on an increasingly frequent basis. Thus, the creators of these services designed automatic "robots" that roam the Web and retrieve Web site information for inclusion in the database. While this deals with the speed problem quite nicely, there is a serious problem introduced by this automatic approach: Machines, even ones with so-called artificial intelligence software, are still nowhere near as good as humans at categorizing information (well, at least not into categories that make sense to humans!). When a search engine's robot visits a site, it incorporates all of the text on that site into its database for reference in subsequent user searches. This means that a word inadvertently placed in the text of your Web site can cause people to find your site via searches on that word, thinking that your site might have something to do with that word! Suppose that you've set up a Web site about gardening, and in it you include a personal anecdote about how much your dog loves being outdoors with you. Thousands of dog-lovers might find your site because of that reference to your dog, be surprised that the site is about gardening and not dogs, and promptly leave! There are many other problems associated with the way automatic search engines work, which you'll no doubt discover when your site is added to them. |
With the incredible corporate interest in the World Wide Web in the past few years, tracking users helps us get closer to an answer to the most crucial question for most organizations getting on the Web: Does the Web really work? In other words, does their Web site attract visitors, and if so, do those visitors turn into customers? In other media, hard numbers are available as answers to these questions. Newspapers have circulation figures, radio has broadcast ranges, and television has Nielsen ratings. It's surprising how many Web sites have unmonitored access levels since more precise visitor information can be gained on the Internet than through any other medium.
There is one key advantage these other media have over the Web, however: access to demographic information. The reason that accurate demographics (for example, the makeup of the audience by age, sex, income, and so on) are much more readily available for these traditional media is because the level of market penetration is such that a representative sampling of the general population in that area can be extrapolated meaningfully to apply to your whole audience. With the Web, you have several problems in doing this:
Both of these problems mean that the only way you could get accurate demographics would be while people are actually visiting your Web site. This can come across as somewhat obtrusive, and people accustomed to browsing through Web sites at high-speed with little or no thought involved have to be given a very good incentive to spend the time to fill out a survey form for your benefit.
This means that it's all the more crucial to identify whatever hard numbers you can automatically, and this is where the idea of tracking users comes in.
This section deals with one of the fundamental methods of collecting demographic information about visitors to your Web site-the access log.
So where do we begin when trying to find out information about
visitors to our site? How about on our Web server itself! It's
mentioned earlier on in the book that HTTP, the HyperText
Transfer Protocol, enables communication between your browser
and the Web server via
a series of discrete connections that fetch the text of the Web
page being retrieved, and then each one of the graphics on that
page in sequence. Did you know that every single time one of these
requests is made, a record of that request is written to a log
file? Here is a sample of the contents of an access log, from
the file access-log, produced by ncSA httpd.
ts17-15.slip.uwo.ca - - [09/Jul/1996:01:53:53 -0500]
"POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1519
ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:22 -0500]
"POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1954
ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:43 -0500]
"POST /cgiunleashed/shopping/cart.cgi HTTP/1.0" 200 1678
pm107.spots.ab.ca - - [09/Jul/1996:01:59:28 -0500] "GET /pics/asd.gif HTTP/1.0" Â304 0
b61022.dial.tip.net - - [09/Jul/1996:02:03:36 -0500] "GET /pics/asd.gif HTTP/Â1.0" 200 4117
slip11.docker.com - - [09/Jul/1996:02:03:49 -0500] "GET /rcr/ HTTP/1.0" 200 8751
slip11.docker.com - - [09/Jul/1996:02:04:17 -0500] "GET /rcr/guest.html HTTP/Â1.0" 200 2984
slip11.docker.com - - [09/Jul/1996:02:05:01 -0500] "GET /rcr/store.html HTTP/Â1.0" 200 34717
port52.annex1.net.ubc.ca - - [09/Jul/1996:02:05:09 -0500] "GET /pics/asd.gif ÂHTTP/1.0" 200 4117
slip11.docker.com - - [09/Jul/1996:02:06:01 -0500] "GET /rcr/regint.html HTTP/Â1.0" 200 19452
ncSA, CERN, and Apache httpd all produce access logs in very similar formats, and collectively they have the vast majority of Web server market share, so this section will deal with extracting information from those servers. Other Web servers may store information in a different format, and you should consult the documentation that comes with yours to learn how to read it.
Note |
You may have heard of the HTTP keep-alive protocol, which allows for a continuous connection to be maintained between the Web server and the Web browser. This doesn't contradict the nature of the discrete connections in HTTP; there are still multiple fetches made from the Web server. The difference is that the connection isn't terminated and restarted between each one while retrieving information on the same Web page. |
Now, let's take a look at some of the information that is provided in the access log. The lines all take on a standard format, and, in fact, the entire access log consists of nothing but lines like these. The format of the lines is as follows:
host rfc931 authuser [DD/Mon/YYYY:hh:mm:ss] "request" ddd bbbb "opt_referer" Â"opt_agent"
Here's a breakdown of the elements included in the lines:
Note that the last two fields are not usually enabled on most systems, and thus our sample program won't process them. It's easy enough to modify it so that it does, however.
With a line not only for each Web page access, but in fact for each graphic on each Web page as well, you might be able to imagine why access log files can grow to become several megabytes in size very quickly. If your Web server has a limited amount of hard drive space, the access log's growth might even risk crashing it!
One solution to this problem is to delete the access log on a regular basis, after creating a summary of the information in it. So how exactly do you create a summary? Good question! this is where we get into our first program for this chapter, an httpd access log parser. The individual lines in the access log file, while providing a fairly detailed amount of information, aren't terribly useful when viewed in their raw form. However, they can be used as the basis for all kinds of reports you can create with software that summarizes the information into various categories. An example of such a program is included in Listing 21.1. Its output is shown in Figure 21.1., the Access Log Summary program. This program reads in the server access log file and generates an HTML document as output. The document summarizes all of the raw information presented in the access log into useful categories.
Figure 21.1: The output from the access log summary program.
Listing 21.1. Source code for the Access Log Summary program.
// accsum.cpp -- AccESS LOG SUMMARY PROGRAM
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program reads in the server access log file and generates an HTML
// document as output. The document summarizes all of the raw information
// presented in the access log into useful categories
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// The categories it summarizes information for:
//
// * # of hits by domain
// * # of hits by file path
// * # of hits by day
// * # of hits by hour
//
// GENERAL ALGORITHM
//
// 1. For each domain and file path, dynamically create a linked list
// for each value, and add 1 to the hit count each time.
//
// 2. Create a linked list for each date, as well as each hour also.
//
// 3. Send the output to stdout.
// IncLUDES ***********************************************************
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "linklist.h" // Linked List Header Files
#include "linklist.cpp" // Linked List Source Code
// DEFINES AND STRUCTURES *********************************************
#define MAX_STRING 256
#define DATE_STRING 32
#define HOUR_STRING 5
#define LOG_FILE "./test-access-log"
typedef struct
{ char hostname[MAX_STRING];
int num_access;
} sHOSTNAME;
typedef struct
{ char filename[MAX_STRING];
int num_access;
} sFILENAME;
typedef struct
{ char hour[HOUR_STRING];
int num_access;
} sHOUR;
typedef struct
{ char date[DATE_STRING];
int num_access;
} sDATE;
// FUncTION PROTOTYPES ************************************************
int main(int argc, char *argv[], char *env[]);
void ProcessLine( char * line );
void PrintOutput( void );
void InitAll(void);
void DestroyAll(void);
// GLOBAL VARIABLES ***************************************************
sLINK * link_hostname;
sLINK * link_filename;
sLINK * link_hour;
sLINK * link_date;
// FUncTIONS **********************************************************
int main(int argc, char *argv[], char *env[])
{
// Opens the access log file, parses the information into a linked list
// internal data representation, then sends the summary of the output to
// stdout.
printf("Content-type: text/html\n\n");
printf("<HTML><TITLE>Access Log Summary</TITLE><BODY>\n");
printf("<H1>Access Log Summary</H1>\n");
FILE * fp;
fp = fopen( LOG_FILE, "r" ); // open the access log file
if( ! fp )
{ printf("ERROR: Couldn't load log file!"); // abort painlessly
}
else // if able to load file...
{ char line[512];
InitAll();
for(;;)
{ // fetch lines until EOF encountered
if( fgets( line, 511, fp ) == NULL ) break;
ProcessLine( line ); // extract the important information
}
PrintOutput(); // send the output to stdout
}
DestroyAll();
printf("</ul></BODY></HTML>\n"); // end the HTML file
return(0); // terminate gracefully
}
void InitAll(void)
{
// Initialize the heads for each of the linked lists
InitHead( &link_hostname );
InitHead( &link_filename );
InitHead( &link_hour );
InitHead( &link_date );
}
void DestroyAll(void)
{
// Destroy each of the linked lists (to free memory)
DestroyList( &link_hostname );
DestroyList( &link_filename );
DestroyList( &link_hour );
DestroyList( &link_date );
}
void ProcessLine( char * line )
{
// Parse a single line of a standard web server access log
sHOSTNAME hn;
sFILENAME fn;
sHOUR hr;
sDATE dt;
char * left, * right;
sLINK * l;
left = line;
right = strchr( left, ' ' ); // find the first space
if( ! right ) return; // bad entry
memcpy( hn.hostname, left, right-left ); // get the first one
*(hn.hostname + (right-left) ) = '\0';
l = FindNode( link_hostname, (void *) &hn, 0, strlen( hn.hostname ) );
if( ! l )
{ hn.num_access = 1;
AddNode( link_hostname, (void *) &hn, sizeof( sHOSTNAME ) );
}
else
{ ((sHOSTNAME *) l->data)->num_access++;
}
left = right+1; // skip the space
right = strchr( left, ' '); // find the next space (rfc931)
if( ! right ) return; // bad entry
left = right+1; // skip the space
right = strchr( left, ' '); // find the next space (authuser)
if( ! right ) return; // bad entry
left = right+1; // skip the space
right = strchr( left, ':'); // find the colon (date delimiter)
if( ! right ) return; // bad entry
left++; // skip the leading '['
memcpy( dt.date, left, right-left ); // get the first one
*(dt.date + (right-left) ) = '\0';
l = FindNode( link_date, (void *) &dt, 0, strlen( dt.date ) );
if( ! l )
{ dt.num_access = 1;
AddNode( link_date, (void *) &dt, sizeof( sDATE ) );
}
else
{ ((sDATE *) l->data)->num_access++;
}
left = right+1; // skip the colon
right = strchr( left, ':'); // find the next colon (hour delimeter)
if( ! right ) return; // bad entry
memcpy( hr.hour, left, right-left ); // get the first one
*(hr.hour + (right-left) ) = '\0';
l = FindNode( link_hour, (void *) &hr, 0, strlen( hr.hour ) );
if( ! l )
{ hr.num_access = 1;
AddNode( link_hour, (void *) &hr, sizeof( sHOUR ) );
}
else
{ ((sHOUR *) l->data)->num_access++;
}
left = strchr( line, '\"' ); // find the beginning of the request
if( ! left ) return; // bad entry
right = strchr( left, ' ' ); // find the first space (Query Type)
if( ! right ) return; // bad entry
left = right+1; // skip the space
right = strchr( left, ' ' ); // find the next space (filename with path)
if( ! right ) return; // bad entry
memcpy( fn.filename, left, right-left ); // get the first one
*(fn.filename + (right-left) ) = '\0';
l = FindNode( link_filename, (void *) &fn, 0, strlen( fn.filename ) );
if( ! l )
{ fn.num_access = 1;
AddNode( link_filename, (void *) &fn, sizeof( sFILENAME ) );
}
else
{ ((sFILENAME *) l->data)->num_access++;
}
}
void PrintOutput( void )
{
// Send the output from the program to stdout
sLINK * l;
l = link_date;
printf("<H2>By Date</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sDATE *) (l->data))->date,
((sDATE *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
l = link_hour;
printf("<H2>By Hour</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sHOUR *) (l->data))->hour,
((sHOUR *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
l = link_hostname;
printf("<H2>By Hostname</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sHOSTNAME *) (l->data))->hostname,
((sHOSTNAME *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
l = link_filename;
printf("<H2>By Filename</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sFILENAME *) (l->data))->filename,
((sFILENAME *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
}
This program makes use of linked lists, which aren't supported directly in C as associative arrays are in Perl. Thus, there are some support routines that are needed in order to make the program function properly, and they are included here, in Listings 21.2 and 21.3.
Listing 21.2. The linked list routine.
// linklist.h -- The Header file for the Linked List Routines
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// By Shuman Ghosemajumder, Anadas Software Development
// STRUCTURES *********************************************************
typedef struct linked_list
{ struct linked_list * next;
void * data;
} sLINK;
// LINKED LIST FUncTION PROTOTYPES ************************************
void InitHead( sLINK * * head );
void DestroyList( sLINK * * head );
int CountNodes( sLINK * head );
sLINK * GetNext( sLINK * l );
sLINK * AddNode( sLINK * head, void * data, int data_size );
sLINK * FindNode( sLINK * head, void * data, int offset, int data_size );
Listing 21.3. Source code for the linked list functions.
// linklist.cpp -- Linked List Functions
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// By Shuman Ghosemajumder, Anadas Software Development
void InitHead( sLINK * * head )
{
// Initialize the head pointer of a linked list
*head = (sLINK *) malloc( sizeof(sLINK) ); // allocate memory
if( ! *head )
{ printf("Memory allocation error.\n");
exit(-1);
}
(*head)->data = NULL; // no data yet
(*head)->next = NULL; // no next pointer yet
}
void DestroyList( sLINK * * head )
{
// Destroy an entire linked list
sLINK * l = *head;
sLINK * temp;
for(;;) &nb sp; // loop to destroy
{ if( l->data ) free( l->data ); // each node of the list
if( l->next )
{ temp = l;
l = l->next;
free( temp ); // thus freeing memory
}
else break;
}
*head = NULL; // destroy the head pointer
}
sLINK * AddNode( sLINK * head, void * data, int data_size )
{
// Add a node to the linked list
sLINK * next = head;
sLINK * last;
do
{ last = next;
next = GetNext( next );
} while( next ); // go to the end of the list
// next == NULL, therefore last == the last node
if( last->data == NULL )
{ next = last;
}
else
{ next = (sLINK *) malloc( sizeof(sLINK) );
if( ! next )
{ printf("Memory allocation error.\n");
exit(-1);
}
last->next = next;
}
next->data = (void *) malloc( data_size );
if( ! next->data )
{ printf("Memory allocation error.\n");
exit(-1);
}
memcpy( next->data, data, data_size );
next->next = NULL;
return ((sLINK *) next);
}
int CountNodes( sLINK * head )
{
// Return the total number of nodes in the linked list
int count = 0;
do
{ head = GetNext( head );
count++;
} while( head );
return count;
}
sLINK * GetNext( sLINK * l )
{
// Given one node of the list, return a pointer to the next node if it
// exists, or NULL if it doesn't.
if( l->next != NULL ) return ((sLINK *) l->next);
else return NULL;
}
sLINK * FindNode( sLINK * head, void * data, int offset, int data_size )
{
// Compare "data" to the value at "offset" in the data structure portion
// of the linked list, and return a pointer to the node which contains
// this value if there is one.
for(;;)
{ if( head->data != NULL )
{ if( memcmp( (char *) head->data + offset, (char *) data, data_size ) == Â0 )
{ return ( (sLINK *) head );
}
if( head->next ) head = head->next;
else return NULL;
}
else
{ return NULL;
}
}
}
This program is a good starting point, but ideally you'd like to be able to have it compiled automatically. As mentioned before, access logs are often several megabytes (some can be several hundred megabytes!) in size, so the idea of generating these kinds of statistics in real-time every time the user accesses the on-line summary page is unfeasible on most computer systems. The best solution is to have these summaries created in the background of the Web server on a regular basis, so users always get a reasonably current set of information and don't have to wait for several minutes while it processes the access log file. There's a UNIX program called crontab that allows you to schedule events (such as the execution of your program) in the background. Here's how it works. First, you need to ensure that you (and not the Web server process) has access to crontab; contact your UNIX admin to let him or her know of your requirement.
Caution |
In general, the Web server process should have access to exactly what it needs access to-nothing more and nothing less. Remember that if a rogue user gains control of the Web server process (via a false crontab file or some other means), then he or she would be able to effectively execute privileged commands with total anonymity-something which is never a good situation on a computer system. |
After you've set up your crontab access, you should edit your crontab file and add a line similar to the following:
* 06 * * * /usr/home/big/anadas/cgiunleashed/auto-make
You should read your system's man page for crontab to ensure that you have your crontab file set up correctly.
Now that you've got crontab set up, you'll need to have an access log summary program that produces a Web-viewable summary.
The Web server's access log feature functions by recording information about the user who is visiting your server, which is sent from the user's own browser. While the information the access log records is very useful, it is by no means an exhaustive account of everything the browser "tells" the Web server about itself and the user.
Let's take a look at the output of the environment variables program first used in Chapter 12, "Imagemaps" (program is available on-line at http://www.anadas.com/cgiunleashed/imagemaps/exe/showenv.cgi):
SERVER_SOFTWARE=ncSA/1.5
GATEWAY_INTERFACE=CGI/1.1
DOCUMENT_ROOT=/usr/home/big/anadas
REMOTE_ADDR=199.45.70.220
SERVER_PROTOCOL=HTTP/1.0
REQUEST_METHOD=GET
REMOTE_HOST=tc220.wwdc.com
QUERY_STRING=
HTTP_USER_AGENT=Mozilla/3.0b5a (Win95; I)
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin:/usr/contrib/bin:/usr/X11/bin
HTTP_CONNECTION=Keep-Alive
HTTP_AccEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
SCRIPT_NAME=/cgiunleashed/imagemaps/exe/showenv.cgi
SERVER_NAME=www.anadas.com
SERVER_PORT=80
HTTP_HOST=www.anadas.com
SERVER_ADMIN=shuman@anadas.com
This is the complete set of environment variable information for the Web server process on this particular server, when a particular user accessed the script in question. Most of these variables are passed from the browser to the Web server, via the CGI interface. Note, however, that some of the variables are set entirely on the Web server's end, for the benefit of CGI programs that need to know additional information about their environment. So what do these environment variables mean?
SERVER_SOFTWARE: This indicates the actual Web server software, which in this case is ncSA httpd version 1.5.
GATEWAY_INTERFACE: This is the level of CGI compatibility supported by the server, which in this case is 1.1.
DOCUMENT_ROOT: This is also a server-set environment variable. It indicates the location of the root document for the Web server (http://www.anadas.com).
REMOTE_ADDR: This environment variable is passed by the browser and indicates the IP address of the browser's Internet connection.
SERVER_PROTOCOL: This environment variable is set by the browser and indicates the HTTP compatibility level.
REQUEST_METHOD: This environment variable is set by the browser according to the kind of query it has sent to the Web server. Normal document and file retrievals are classified as GET queries.
REMOTE_HOST: This environment variable is sent by the browser and indicates the hostname associated with its IP address, if applicable.
QUERY_STRING: This environment variable is set according to the information that is passed by the query. In the case of a GET query, the query string consists of whatever information is after the question mark (?) in the URL.
HTTP_USER_AGENT: This environment variable allows the browser to tell the server what its product name and version number are.
PATH: Every UNIX user has a path associated with his or her login, and the Web server process is no exception.
HTTP_CONNECTION: This environment variable is set by the Web browser to tell the server whether or not it supports a keep-alive connection.
HTTP_AccEPT: This environment variable allows the Web browser to tell the Web server the different data formats it accepts in-line (plug-ins not included).
SCRIPT_NAME: This environment variable is set by the Web server and identifies the script that is being run.
SERVER_NAME: This environment variable is set by the Web server and identifies the Web server's hostname.
SERVER_PORT: This environment variable is set by the Web server and identifies the port address the server is "listening to" for connections.
HTTP_HOST: This environment variable indicates the hostname of the Web server's host.
SERVER_ADMIN: This environment variable, set by the Web server, indicates the e-mail address of the Web server administrator.
AUTH_TYPE: If the server supports user authentication, and the script is protected, this is the protocol-specific authentication method used to validate the user.
REMOTE_USER: If the server supports user authentication, and the script is protected, this is the username they have authenticated as.
REMOTE_IDENT: If the HTTP server supports RFC 931 identification, this variable will be set to the remote username retrieved from the server.
DOCUMENT_NAME: The current filename.
DOCUMENT_URL: The virtual path to the document.
QUERY_STRING_UNESCAPED: The unescaped version of any search query the client sent, with all shell-special characters escaped with \.
DATE_LOCAL: The current date and local time zone. Subject to the timefmt parameter to the config command.
DATE_GMT: Same as DATE_LOCAL but in Greenwich Mean Time.
LAST_MODIFIED: The last modification date of the current document. Subject to timefmt like the others.
Note that not all of these variables appear on the sample output. This is because different servers and browser combinations created different environment variables. Netscape Navigator, Microsoft Internet Explorer, and many other Web browsers each put their own spin on environment variables, and either provide more environment variables or send richer information in the aforementioned variables. For example, Internet Explorer sends the current screen resolution in the browser-type environment variable. This allows dynamically generated Web pages to optimize their appearance for a particular screen size.
Can I Get E-Mail Addresses? |
One of the questions most often puzzled over by CGI programmers is whether or not they can obtain a user's e-mail address. Creators of browser software are very sensitive to this issue, and the answer is, in most cases, no. There are certain browsers that pass along this information, at least to some extent. Some browsers that return full e-mail address information are
A browser that returns the username is:
|
The method by which environment variables are extracted in C is presented in Listing 21.4, which is essentially the C version of the showenv.cgi program.
Listing 21.4. Source code for the Web server environment variable printer.
// getenv.cpp -- Web Server Environment Variable Printer
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program displays all of the environment variables available to the
// web server when a user accesses this program via the CGI interface
//
// By Shuman Ghosemajumder, Anadas Software Development
#include <stdio.h>
int main(int argc, char *argv[], char *env[]);
int main(int argc, char *argv[], char *env[])
{
int count;
printf("Content-type: text/html\n\n");
printf("<HTML><TITLE>Environment Variables</TITLE><BODY>\n");
printf("<H1>Web Server Environment Variables</H1><ul>\n");
for(count=0;env[count];)
{ printf("<B>Var %d.</B> %s<BR>\n", count, env[count++] );
}
printf("</ul></BODY></HTML>\n");
return(0); // exit gracefully
}
Having the ability to parse ready-made server access logs is wonderful, but what if you don't have access to those logs? As long as you can execute CGI scripts, you can create your own logs dynamically. Listing 21.5 is an example of a program that generates a "Pseudo Access Log File" every time it is loaded. This program creates a log file similar to the server log files, but with richer information.
Listing 21.5. Source code for the make log program.
// makelog.cpp -- MAKE LOG PROGRAM
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program creates a log file similar to the server log files, just
// with richer information.
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// GENERAL ALGORITHM
//
// 1. Get the desired environment variables
//
// 2. Write them to a file!
// IncLUDES ***********************************************************
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
// DEFINES AND STRUCTURES *********************************************
#define MAX_STRING 256
#define DATE_STRING 32
#define HOUR_STRING 5
#define LOG_FILE "./pseudo-log"
// FUncTION PROTOTYPES ************************************************
int main(int argc, char *argv[], char *env[]);
void SafeGetEnv( char * env_name, char * * ptr, char * null_string );
// FUncTIONS **********************************************************
int main(int argc, char *argv[], char *env[])
{
char * browser,
* hostname,
* refer_url;
char date[32];
char empty_string[1];
time_t bintime;
time(&bintime);
sprintf( date,"%s\0", ctime(&bintime) );
date[24] = '\0'; // exactly 24 chars in length
empty_string[0] = '\0';
SafeGetEnv( "REMOTE_HOST", &hostname, empty_string );
SafeGetEnv( "HTTP_REFERER", &refer_url, empty_string );
SafeGetEnv( "HTTP_USER_AGENT", &browser, empty_string );
FILE * fp;
fp = fopen( LOG_FILE, "a" );
fprintf( fp, "%s %s %s %s\n", date, hostname, refer_url, browser );
fclose( fp );
return (0); // exit gracefully
}
void SafeGetEnv( char * env_name, char * * ptr, char * null_string )
{
// Normally a NULL pointer is returned if a certain environment variable
// doesn't exist and you try to retrieve it. This function set the value
// of the pointer to point at a NULL string instead.
char * tmp;
tmp = getenv( env_name );
if( ! tmp ) *ptr = null_string;
else *ptr = tmp;
}
Now that we have a program to extract environment variable information, we're in much the same situation we were in when we simply had access to the access log file. We can create a huge log file of the various environment variable information we wish to keep track of, but the raw information isn't very useful unless we summarize it and have the output visible through the Web.
Listing 21.6 is a program that parses the pseudo access log created by the program in Listing 21.5. This program reads in the pseudo access log file generated by makelogg.cpp and generates an HTML as output. The document summarizes all of the raw information presented in that access log into useful categories. Figure 21.2 shows some sample output from it.
Figure 21.2: A sample shot of the output from the Pseudo Access Log Summary program
Listing 21.6. Source code listing for the Pseudo Access Log Summary program.
// parselog.cpp -- AccESS LOG SUMMARY PROGRAM for "MAKE LOG"
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program reads in the pseudo access log file generated by parselog.cpp
// and generates an HTML document as output. The document summarizes all of
// the raw information presented in that access log into useful categories.
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// The categories it summarizes information for:
//
// * # of hits by domain
// * # of hits by referrer
// * # of hits by date
// * # of hits by browser
//
// GENERAL ALGORITHM
//
// 1. For each domain and file path, dynamically create a linked list
// for each value, and add 1 to the hit count each time.
//
// 2. Create a linked list for each date, as well as each hour also.
//
// 3. Send the output to stdout.
// IncLUDES ***********************************************************
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "linklist.h" // Linked List Header File
#include "linklist.cpp" // Linked List Functions
// DEFINES AND STRUCTURES *********************************************
#define MAX_STRING 256
#define DATE_STRING 32
#define HOUR_STRING 5
#define LOG_FILE "./pseudo-log"
typedef struct
{ char refer[MAX_STRING];
int num_access;
} sREFER;
typedef struct
{ char browser[MAX_STRING];
int num_access;
} sBROWSER;
typedef struct
{ char hostname[MAX_STRING];
int num_access;
} sHOSTNAME;
typedef struct
{ char date[DATE_STRING];
int num_access;
} sDATE;
// FUncTION PROTOTYPES ************************************************
int main(int argc, char *argv[], char *env[]);
void ProcessLine( char * line );
void PrintOutput( void );
void InitAll(void);
void DestroyAll(void);
// GLOBAL VARIABLES ***************************************************
sLINK * link_hostname;
sLINK * link_date;
sLINK * link_refer;
sLINK * link_browser;
// FUncTIONS **********************************************************
int main(int argc, char *argv[], char *env[])
{
printf("Content-type: text/html\n\n");
printf("<HTML><TITLE>Pseudo Access Log Summary</TITLE><BODY>\n");
printf("<H1>Pseudo Access Log Summary</H1>\n");
FILE * fp;
fp = fopen( LOG_FILE, "r" ); // open the access log file
if( ! fp )
{ printf("ERROR: Couldn't load log file!"); // abort painlessly
}
else // if able to load file...
{ char line[512];
InitAll();
for(;;)
{ // fetch lines until EOF encountered
if( fgets( line, 511, fp ) == NULL ) break;
ProcessLine( line ); // extract the important information
}
PrintOutput(); // send the output to stdout
}
DestroyAll();
printf("</ul></BODY></HTML>\n"); // end the HTML file
return(0); // exist gracefully
}
void InitAll(void)
{
// Initialize the head pointers
InitHead( &link_hostname );
InitHead( &link_refer );
InitHead( &link_browser );
InitHead( &link_date );
}
void DestroyAll(void)
{
// Destroy the linked lists and free memory
DestroyList( &link_hostname );
DestroyList( &link_refer );
DestroyList( &link_browser );
DestroyList( &link_date );
}
void ProcessLine( char * line )
{
// Process a single line of the pseudo access log file
sHOSTNAME hn;
sREFER rf;
sBROWSER bs;
sDATE dt;
char * left, * right;
sLINK * l;
// Line Structure:
//
// get the date (24 chars)
// get a space
// get the hostname
// get a space
// get the refering URL
// get a space
// get the browser type (the remainder of the line)
left = line;
right = (char *) left + 10;
memcpy( dt.date, left, right-left );
*(dt.date + (right-left) ) = '\0';
l = FindNode( link_date, (void *) &dt, 0, strlen( dt.date ) );
if( ! l )
{ dt.num_access = 1;
AddNode( link_date, (void *) &dt, sizeof(sDATE) );
}
else
{ ((sDATE *) l->data)->num_access++;
}
left = &line[25]; // skip the hour and the space
right = strchr( left, ' ' ); // find the next space
if( ! right ) return; // bad entry
memcpy( hn.hostname, left, right-left ); // get the first one
*(hn.hostname + (right-left) ) = '\0';
l = FindNode( link_hostname, (void *) &hn, 0, strlen( hn.hostname ) );
if( ! l )
{ hn.num_access = 1;
AddNode( link_hostname, (void *) &hn, sizeof( sHOSTNAME ) );
}
else
{ ((sHOSTNAME *) l->data)->num_access++;
}
left = right+1; // skip the space
right = strchr( left, ' ' ); // find the next space (filename with path)
if( ! right ) return; // bad entry
memcpy( rf.refer, left, right-left ); // get the first one
*(rf.refer + (right-left) ) = '\0';
l = FindNode( link_refer, (void *) &rf, 0, strlen( rf.refer ) );
if( ! l )
{ rf.num_access = 1;
AddNode( link_refer, (void *) &rf, sizeof( sREFER ) );
}
else
{ ((sREFER *) l->data)->num_access++;
}
left = right+1; // skip the space
right = strchr( left, '\n' ); // find the end
if( ! right ) return; // bad entry
memcpy( bs.browser, left, right-left ); // get the first one
*(bs.browser + (right-left) ) = '\0';
l = FindNode( link_browser, (void *) &bs, 0, strlen( bs.browser ) );
if( ! l )
{ bs.num_access = 1;
AddNode( link_browser, (void *) &bs, sizeof( sBROWSER ) );
}
else
{ ((sBROWSER *) l->data)->num_access++;
}
}
void PrintOutput( void )
{
// Send the output of the program to stdout
sLINK * l;
l = link_date;
printf("<H2>By Date</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sDATE *) (l->data))->date,
((sDATE *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
l = link_hostname;
printf("<H2>By Hostname</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sHOSTNAME *) (l->data))->hostname,
((sHOSTNAME *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
l = link_refer;
printf("<H2>By Referer</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B><a href=\"%s\">%s</a> :</B> %d\n",
((sREFER *) (l->data))->refer,
((sREFER *) (l->data))->refer,
((sREFER *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
l = link_browser;
printf("<H2>By Browser</H2>\n");
printf("<ul>\n");
for(;l;)
{ if( l->data )
{ printf("<li> <B>%s :</B> %d\n", ((sBROWSER *) (l->data))->browser,
((sBROWSER *) (l->data))->num_access );
l = l->next;
}
else break;
}
printf("</ul>\n");
}
This program can also be run on a regular basis via crontab, and thus users always have access to relatively current information. If it's critical that users have access to immediate information, you can create an access log program that uses some sort of database management system to find pre-existing "user records" (sorted perhaps on hostname or IP address) and adds information to that "user profile." Thus, the information would always be in a summarized format, and the on-line reader program would simply display the file's contents.
Up until now, you may not have given much thought to exactly how your Web server was allowing you to run CGIs. But consider that the programs you've seen so far in this chapter deal with user information that the regular visitor to your Web site would most likely never see. Surely you're not going to make them visit a URL they have no interest in visiting simply so you can store their information! Yet that's exactly what you'd be forced to do if you called your tracking CGIs via a URL that references a program in the /cgi-bin/ directory. Clearly, it's important for the tracking process to be completely transparent to the users yet still work just as efficiently for you. There's more than one way you can accomplish this.
Your Web server is probably set up in such a manner that if you have a file named index.html or perhaps home.html in a specific directory, then that is the HTML file which is loaded by the server and displayed to the browser if the user attempts to load a URL in which the directory name, but not the exact file, is specified. On just about every single Web server, there is an option that can be set (in the srm.conf file on ncSA httpd compatible Web servers) that allows index.cgi to be the default file that is loaded. This allows you to actually run a CGI script every time a user accesses the base document in any directory-while the user sees an HTML file as usual! The easiest way to accomplish this is to make index.cgi a shell script such as
#!/bin/sh
./logapp
echo Content-type: text/html
echo
cat real-home.html
First, the logging program (logapp) is called to store the user information into a file. The log program doesn't actually produce any output, and it has full access to the environment variable information that any explicitly called CGI script would. Then, the two echo commands send the HTTP command to the Web browser that an HTML document is coming forth, after which the actual home document for that directory is sent to the browser. This is the most preferable method because it allows you the greatest degree of control, with the ability to not only execute CGI applications, but also to send direct HTTP commands.
If your server has server-side includes enabled, you can create a .shtml (server-parsed HTML file), which allows you to call a CGI from within the HTML file. You can use the following syntax to invoke a CGI this way:
<!--#exec cmd="Application"-->
Or, if you must execute programs from cgi-bin, use
<!--#exec cgi="CGI Program"-->
If your server has support for neither index.cgi nor index.shtml, you can still create a user-tracking CGI application that is automatically executed when you access a Web site, but it is slightly more limited. You can create a CGI shell script in your cgi-bin directory that looks something like this:
#!/bin/sh
./logapp
echo Content-type: image/gif
echo
cat image.gif
This program sends an image on the Web server to the browser but first executes the user logging application transparently. You would execute this script by including its URL in the Web page you wanted to monitor as an image. For example:
<img src="http://www.anadas.com/cgi-bin/log-image.cgi">
This would display an image on the Web browser, while your logging application would get executed every time the page was loaded-totally transparent to visitors to your site.
The idea of sending an image to the Web browser while "secretly" running a logging application need not be so secret. In fact, many logging applications prefer to return a custom image file that displays information such as the current number of hits to that Web page. You may have seen odometer-like images on some Web sites and wondered how you might create your own. You could certainly use one of the services on the Internet such as www.digits.com, which allows you to use their CGI application to both log your hits and display the fancy graphic, but you now have the tools to create your own such counter.
Listing 21.7 is an example of a simple Web counter. Its output is depicted in Figure 21.3.
Figure 21.3: Sample screen shot of the output from the graphical Web counter.
Listing 21.7. Source code listing for the graphical Web counter script.
// counter.cpp -- a graphical counter for a web page, to be included through
// an IMG tag in an HTML document
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// Written by Shuman Ghosemajumder, Anadas Software Development
//
// General Algorithm:
//
// 1. Determine the filename to be read from / written to.
// 2. Update the counter data.
// 3. Convert the current count to an X-bitmap.
// 4. Output that X-bitmap to stdout
// IncLUDE FILES ************************************************************
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
// DEFINES / PROTOTYPES *****************************************************
#define DIGIT_WIDTH 8
#define DIGIT_HEIGHT 12
#define NUM_DIGITS 6
#define DATA_FILENAME "counter.dat"
int main(int argc, char *argv[], char *env[]);
// GLOBAL VARIABLES *********************************************************
char *xbmp_digits[10][12] = {
{"0x7e", "0x7e", "0x66", "0x66", "0x66", "0x66",
"0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e"},
{"0x18", "0x1e", "0x1e", "0x18", "0x18", "0x18",
"0x18", "0x18", "0x18", "0x18", "0x7e", "0x7e"},
{"0x3c", "0x7e", "0x66", "0x60", "0x70", "0x38",
"0x1c", "0x0c", "0x06", "0x06", "0x7e", "0x7e"},
{"0x3c", "0x7e", "0x66", "0x60", "0x70", "0x38",
"0x38", "0x70", "0x60", "0x66", "0x7e", "0x3c"},
{"0x60", "0x66", "0x66", "0x66", "0x66", "0x66",
"0x7e", "0x7e", "0x60", "0x60", "0x60", "0x60"},
{"0x7e", "0x7e", "0x02", "0x02", "0x7e", "0x7e",
"0x60", "0x60", "0x60", "0x66", "0x7e", "0x7e"},
{"0x7e", "0x7e", "0x66", "0x06", "0x06", "0x7e",
"0x7e", "0x66", "0x66", "0x66", "0x7e", "0x7e"},
{"0x7e", "0x7e", "0x60", "0x60", "0x60", "0x60",
"0x60", "0x60", "0x60", "0x60", "0x60", "0x60"},
{"0x7e", "0x7e", "0x66", "0x66", "0x7e", "0x7e",
"0x66", "0x66", "0x66", "0x66", "0x7e", "0x7e"},
{"0x7e", "0x7e", "0x66", "0x66", "0x7e", "0x7e",
"0x60", "0x60", "0x60", "0x66", "0x7e", "0x7e"}
};
int main(int argc, char *argv[], char *env[])
{
char filename[256]; // the data filename
int xbmp_count[NUM_DIGITS+1]; // the image buffer for the counter
unsigned long count; // the variable to store the counter
int i, j; // Looping variables
if ( argc >= 2 )
{ // if there is a command line parameter (passed after the ? operator
// in the GET query), then the filename to store the data in should
// be that parameter plus a ".dat" extension.
sprintf( filename, "%s.dat", argv[1] );
}
else
{ // Otherwise, use the default filename
strcpy( filename, DATA_FILENAME );
}
FILE * fp;
if( ! (fp = fopen( filename, "rb" )) ) // try to open the file
{ count = 0; // if failure, reset counter
}
else // if success,
{ fread( &count, sizeof(unsigned long), 1, fp ); // read in the counter
fclose(fp);
}
if( fp = fopen( filename, "wb" ) )
{ fwrite( &(++count), sizeof(unsigned long), 1, fp ); // update counter
fclose(fp);
}
printf("Content-type:image/x-xbitmap\n\n"); // the HTTP header
// Separate the digits of the current counter value
xbmp_count[NUM_DIGITS] = '\0';
for( i=0;i< NUM_DIGITS;i++)
{ j = count % 10;
xbmp_count[NUM_DIGITS-1-i] = j;
count /= 10;
}
printf("#define counter_width %d\n",NUM_DIGITS*DIGIT_WIDTH);
printf("#define counter_height %d\n\n",DIGIT_HEIGHT);
// send the X-Bitmap information to stdout
printf("static char counter_bits[] = {\n");
for(i=0;i < DIGIT_HEIGHT; i++)
{
for(j=0;j < NUM_DIGITS; j++)
{
printf("%s", xbmp_digits[xbmp_count[j]][i] );
if( (i < DIGIT_HEIGHT-1 ) || ( j< NUM_DIGITS-1 ) )
{ printf(", ");
}
}
printf("\n");
}
printf("}\n");
}
Listing 21.8 shows some sample HTML code for a hypertext page that uses the graphical Web counter for invoking counter.cgi within a hypertext document.
Listing 21.8. HTML source code for a hypertext page.
<HTML>
<!-- http://www.anadas.com/cgiunleashed/trackuser/counter.html
By Shuman Ghosemajumder, Anadas Software Development --!>
<TITLE>Graphical Web Counter</TITLE>
<BODY>
<H1>Graphical Web Counter</H1>
<ul><li><B>
This page has been accessed
<img src="http://www.anadas.com/cgiunleashed/trackuser/Âcounter.cgi?counter.html"> times.
</B></ul>
</BODY></HTML>
So far you've noticed that we're able to keep track of a great deal of information about visitors to our sites, but most of it is very "computer-related" rather than "real-world." In other words, it's great to know what their IP address, their hostname, and their HTTP-acceptance parameters are, but it's even better to know where they're dialed-in from, or even better, their name. It should already be quite clear that determining a user's name or e-mail address is very near impossible to do on anything remotely resembling a consistent basis, so any such notions are purely fanciful. Determining their general geographic location, however, is a piece of real-world information that is much more realistically attainable.
The location from which a user is dialed-in (or directly connected to the Internet) is a piece of information that is most definitely not passed through any kind of environment variable. In fact, the vast majority of Web browsing programs probably don't have a clue as to where they're running from; one hard drive is just the same as any other to a freshly downloaded copy of Netscape or Internet Explorer, for example. There are two pieces of information you can use to determine geographic information, however: the hostname and the IP address.
The hostname can immediately provide some important, and almost
guaranteed correct, geographic information via the first-level
domain. Internet domains work from right to left, so that the
first-level domain is represented by the rightmost string, the
second-level domain is represented by the value to the left of
that, and so on. For example, in the address www.anadas.com,
com is the first-level (or
top-level) domain, while anadas.com
is the second-level domain. The first-level domains are decidedly
finite in number and determine either the geographical location
or the nature of the organization. For example, .com denotes a
commercial organization, while a .ca extension denotes an organization
in Canada. The various first-level domains are as follows:
Country | Country | ||
Andorra | Lesotho | ||
United Arab Emirates | Lithuania Ex-USSR | ||
Afghanistan | Luxembourg | ||
Antigua and Barbuda | Latvia | ||
Anguilla | Libya | ||
Albania | Morocco | ||
Armenia Ex-USSR | Monaco | ||
Netherland Antilles | Moldavia Ex-USSR | ||
Angola | Madagascar | ||
Antarctica | Marshall Islands | ||
Argentina | Mali | ||
American Samoa | Myanmar | ||
Austria | Mongolia | ||
Australia | Macau | ||
Aruba | Northern Mariana Isl. | ||
Azerbaidjan Ex-USSR | Martinique (Fr.) | ||
Bosnia-Herzegovina Ex-Yugoslavia | Mauritania | ||
Barbados | Montserrat | ||
Bangladesh | Malta | ||
Belgium | Mauritius | ||
Burkina Faso | Maldives | ||
Bulgaria | Malawi | ||
Bahrain | Mexico | ||
Burundi | Malaysia | ||
Benin | Mozambique | ||
Bermuda | Namibia | ||
Brunei Darussalam | New Caledonia (Fr.) | ||
Bolivia | Niger | ||
Brazil | Norfolk Island | ||
Bahamas | Nigeria | ||
Buthan | Nicaragua | ||
Bouvet Island | Netherlands | ||
Botswana | Norway | ||
Belarus Ex-USSR | Nepal | ||
Belize | Nauru | ||
Canada | Neutral Zone | ||
Cocos (Keeling) Isl. | Niue | ||
Central African Rep. | New Zealand | ||
Congo | Oman | ||
Switzerland | Panama | ||
Ivory Coast | Peru | ||
Cook Islands | Polynesia (Fr.) | ||
Chile | Papua New Guinea | ||
Cameroon | Philippines | ||
China | Pakistan | ||
Colombia | Poland | ||
Costa Rica | St. Pierre & Miquelon | ||
Czechoslovakia | Pitcairn | ||
Cuba | Portugal | ||
Cape Verde | Puerto Rico (US) | ||
Christmas Island | Palau | ||
Cyprus | Paraguay | ||
Czech Republic | Qatar | ||
Germany | Reunion (Fr.) | ||
Djibouti | Romania | ||
Denmark | Russian Federation Ex-USSR | ||
Dominica | Rwanda | ||
Dominican Republic | Saudi Arabia | ||
Algeria | Solomon Islands | ||
Ecuador | Seychelles | ||
Estonia Ex-USSR | Sudan | ||
Egypt | Sweden | ||
Western Sahara | Singapore | ||
Spain | St. Helena | ||
Ethiopia | Slovenia Ex-Yugoslavia | ||
Finland | Svalbard & Jan Mayen Isl. | ||
Fiji | Slovak Republic | ||
Falkland Isl.(Malvinas) | Sierra Leone | ||
Micronesia | San Marino | ||
Faroe Islands | Senegal | ||
France | Somalia | ||
France (European Ter.) | Suriname | ||
Gabon | St. Tome and Principe | ||
Great Britain | Soviet Union | ||
Grenada | El Salvador | ||
Georgia Ex-USSR | Syria | ||
Ghana | Swaziland | ||
Gibraltar | Turks & Caicos Islands | ||
Greenland | Chad | ||
Guadeloupe (Fr.) | French Southern Terr. | ||
Equatorial Guinea | Togo | ||
Guyana (Fr.) | Thailand | ||
Gambia | Tadjikistan Ex-USSR | ||
Guinea | Tokelau | ||
Greece | Turkmenistan Ex-USSR | ||
Guatemala | Tunisia | ||
Guam (US) | Tonga | ||
Guinea Bissau | East Timor | ||
Guyana | Turkey | ||
Hong Kong | Trinidad & Tobago | ||
Heard & McDonald Isl. | Tuvalu | ||
Honduras | Taiwan | ||
Croatia Ex-Yugoslavia | Tanzania | ||
Haiti | Ukraine Ex-USSR | ||
Hungary | Uganda | ||
Indonesia | United Kingdom | ||
Ireland | US Minor outlying isl. | ||
Israel | United States | ||
India | Uruguay | ||
British Indian O. Terr. | Uzbekistan Ex-USSR | ||
Iraq | Vatican City State | ||
Iran | St. Vincent & Grenadines | ||
Iceland | Venezuela | ||
Italy | Virgin Islands (British) | ||
Jamaica | Virgin Islands (US) | ||
Jordan | Vietnam | ||
Japan | Vanuatu | ||
Kenya | Wallis & Futuna Islands | ||
Kirgistan Ex-USSR | Samoa | ||
Cambodia | Yemen | ||
Kiribati | Yugoslavia | ||
Comoros | South Africa | ||
St. Kitts Nevis Anguilla | Zambia | ||
Korea (North) | Zaire | ||
Korea (South) | Zimbabwe | ||
Kuwait | Old-style Arpanet | ||
Cayman Islands | Commercial | ||
Kazachstan Ex-USSR | Educational | ||
Laos | Government | ||
Lebanon | International | ||
Saint Lucia | US Military | ||
Liechtenstein | Nato | ||
Sri Lanka | Network | ||
Liberia | Non-Profit Organization |
If you're lucky enough to get a user whose hostname contains one of the geographical top-level domains, you can easily match the extension against the preceding table and determine which country he or she is from. However, the vast majority of users on the Internet are likely going to be accessing your site from a .com, .org, .edu, or .net domain. These domains are administered by InterNIC and can be given to organizations and institutions all over the world. Thus, the domain name alone doesn't provide us with their geographical location.
This is where the InterNIC database itself comes in. Whenever an organization is administered a domain name by InterNIC, a record is kept of various information about that organization on InterNIC's own computer system. InterNIC is kind enough to allow the public access to this information, and the speed and ease by which one can access it is excellent. The InterNIC whois database can be accessed with the following command:
whois -h rs.internic.net [domain name]
where [domain name] is the name of the domain you want further information on. Remember that in order to be able to find any information in InterNIC's database on a domain, that domain must have been directly administered by InterNIC. Thus, trying to access information on a .ca domain (which is administered by the CA domain registration committee in Canada) is quite futile. Here is an example of the output from a whois query on the domain name anadas.com:
Anadas Software Development (ANADAS-DOM)
38 Grasmere Crescent
London, Ontario N6G 4N8
CANADA
Domain Name: ANADAS.COM
Administrative Contact, Billing Contact:
Ghosemajumder, Shuman (SG331) shuman@ANADAS.COM
(519) 858-0021
Technical Contact, Zone Contact:
Dice, Richard (RD78) rdice@ANADAS.COM
(519) 858-0021
Record last updated on 10-Jun-96.
Record created on 15-Jul-95.
Domain servers in listed order:
NS.ANADAS.COM 199.45.70.4
NS.UUNET.CA 142.77.1.1
AUTH01.NS.UU.NET 198.6.1.81
Note that originally all we had was the hostname (anadas.com), yet now we have the company's country of origin, their province, and even their street address! In addition, we have contact names and even phone numbers! Of course, there's no guarantee that the individual user at the given address is going to be one of the InterNIC registration contacts; in fact, for most organizations, the odds are quite against it. But we do know the country associated with this organization, so we can record it as an access from Canada.
In many cases, the information on a particular hostname may be difficult to find on InterNIC's whois server because the domain is administered by a parent organization. Or perhaps you might have a numerical IP address that is sent as the hostname field. In these instances, you must do a whois lookup on the IP address itself, another query format supported by InterNIC's whois server.
In the case of a domain that is administered by a parent organization, it's useful to use nslookup to determine the IP address of the actual machine. The format for calling nslookup is
nslookup [hostname]
In this case, doing a lookup on www.anadas.com yields the following output:
Name: www.anadas.com
Address: 199.45.70.165
The IP address will always have four numbers separated by three periods, and the fourth number can always be ignored because it is resolved by the DNS server local to that domain. So we then do a whois query on 199.45.70, which yields the same information as before (or the information for the controlling organization we're looking for). Note that if this information is not available, we can strip off the next number and do a lookup on 199.45, which will give an even more generalized answer.
The information returned by InterNIC is in a relatively standardized format that is easily machine-parsable to allow you to create programs that automatically log additional information based on the hostname or IP address.
Tracking user's geographical locations by using the IP address or hostname as the basis for an InterNIC whois query works in most cases, but certainly not in all. Consider the case of an Internet Service Provider (ISP) based in Houston, which may have points of presence in New York and Los Angeles. The New York users would still have an IP address registered to the company in Houston, but recording their visit as a visit from a person in Houston would be quite erroneous. An example of this, on a much bigger scale, is the case of major on-line services like CompuServe and America Online. These services now provide access to the Internet, but it's all done through proxy servers connected to their centralized network. This means that users all over North America would be reported as connecting from the headquarters of the on-line service they were using rather than where they were really connecting from!
A work-around is to attempt to identify the major on-line services and organizations and build in contingency routines for users from those sites. But in the end, there are no totally definite methods of determining the geographic location of a user when given only an ambiguous IP address or hostname.
Until now, we've been discussing methods of determining information about users prior to their visiting your Web site. Details such as their browser type, geographic location, and e-mail address exist before they ever visit your Web site. However, it's often very useful to be able to determine information about users after they've visited your Web site for the first time.
This is an excellent application for cookies. When a user initially visits your site, a cookie is assigned to their browser, which is then sent back to your Web server on each subsequent connect to your site. Thus, you can track information about how many "repeat visitors" your site gets, plus how these repeat visitors use the content on your site.
Listing 21.9 shows an example of a program that tracks users' visits through the use of cookies. Its output is depicted in Figure 21.4.
Figure 21.4: Sample screen shot of the output from the cookie-based counter..
Listing 21.9. Source code listing for the cookie counter script.
// set-cookie.cpp -- SET COOKIE PROGRAM
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
//
// This program uses cookies to track the number of times a specific user
// has visited the script.
//
// By Shuman Ghosemajumder, Anadas Software Development
//
// GENERAL ALGORITHM
//
// 1. Check whether or not a cookie was passed.
// 2. If one was, increment the counter. If not, create a blank cookie.
// 3. Re-send the new cookie, blank or otherwise, to the browser.
// 4. Display the relevant output to stdout
//
// Notes: This program uses META HTTP-EQUIV rather than an actual HTTP
// directive to ensure maximum compatibility. Certain servers seem
// to have problems with cookies, but this should work across most
// platforms.
// IncLUDES ***********************************************************
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
// FUncTION PROTOTYPES ************************************************
int main(int argc, char *argv[], char *env[]);
void SafeGetEnv( char * env_name, char * * ptr, char * null_string );
// FUncTIONS **********************************************************
int main(int argc, char *argv[], char *env[])
{
char * cookie;
char empty_string[1];
char * p;
int val=0;
empty_string[0] = '\0';
SafeGetEnv( "HTTP_COOKIE", &cookie, empty_string );
printf("Content-type: text/html\n\n");
printf("<HTML><HEAD>");
printf("<META HTTP-EQUIV=\"Set-Cookie\" ");
p = strstr( cookie, "COUNT=" );
if( ! p )
printf("Content=\"COUNT=0; expires=01-Jan-99 GMT; path=/cgiunleashed/Âtrackuser; domain=.anadas.com\">\n");
else
{ p += strlen("COUNT=");
char * ps;
ps = strchr( p, ';');
*ps = '\0';
val = atoi( p );
val++;
printf("Content=\"COUNT=%d; expires=01-Jan-99 GMT; path=/cgiunleashed/Âtrackuser; domain=.anadas.com\">\n", val);
}
printf("<TITLE>Cookie Test</TITLE></HEAD>\n");
printf("<BODY>\n");
printf("<H1>Cookie Test!</H1><HR><P>\n");
if( val > 0 )
{ printf("<H3>You have been here %d times!</H3>\n", val );
}
else
{ printf("<H3>You have now been assigned a cookie!</H3>\n");
}
printf("</BODY></HTML>\n");
return(0); // exit gracefully
}
void SafeGetEnv( char * env_name, char * * ptr, char * null_string )
{
// Normally a NULL pointer is returned if a certain environment variable
// doesn't exist and you try to retrieve it. This function sets the value
// of the pointer to point at a NULL string instead.
char * tmp;
tmp = getenv( env_name );
if( ! tmp ) *ptr = null_string;
else *ptr = tmp;
}
We've discussed several general methods of tracking information about any visitor to our Web site. But what about specific users? The markets for most successful Web sites that aren't incredibly general-purpose themselves (such as search engines or total Internet directories like Yahoo!) are usually very specifically targeted. This means that you already know certain things about the majority of your users, which can give you an advantage in tracking additional information about them.
For example, if you were creating a site for doctors and other health care professionals, you could use a database of all the major hospitals in North America to determine which hostnames and IP addresses correspond to which health care centers.
Earlier in the chapter, I stated that you couldn't get a general user's e-mail address on any consistent basis. While this is true, when you have a highly targeted Web site that generates hits from a limited audience, there is the possibility of determining the user's e-mail address-if, and only if, you have the name of the machine where their actual login takes place, and that machine has a publicly accessible finger daemon configured and running.
If you think this sounds like a very specific set of circumstances, you're right. Fortunately, the vast majority of ISPs (Internet Service Providers) and even most standard servers are set up in this manner. The format for the finger command in this case is
finger @hostname
Keep in mind that the hostname is not necessarily the hostname they are accessing your site from. In the case of dial-up users, the hostname they are accessing you from refers to a specific SLIP or PPP port while you're looking for the server that contains the catalog of all SLIP or PPP connections. In the case that the user is accessing your site from a terminal on the reported hostname, you may have better luck. If you do manage to determine the hostname of the server you're looking for, the output will be something like this:
[dialup.anadas.com]
USER TTY FROM LOGIN@ IDLE WHAT
tsuki 00 borg 11:55AM 54 -su (tcsh)
rxm43 p0 pm66 9:43AM 0 -tcsh (tcsh)
ayondey p1 alice 11:36AM 30 -su (tcsh)
challaday p2 tc248 1:07PM 59 -tcsh (tcsh)
damian p3 lorne 2:17PM 19 /bin/sh /usr/local/bin/mm (mm)
shuman p4 sky 1:35PM 1 netscape &
rsilver p5 pm81 2:34PM 0 w
Notice that we're given a complete list of users who are currently on the system in question. We would then determine which of these users was our visitor by looking at the WHAT field to see which user was running a Web browser at the time of our lookup. In this case, we see that user shuman was running Netscape Navigator, so he is the one who was accessing our site.
Caution |
This example provides a great deal of information about the user who has accessed your site and will work under only the right, "lucky" circumstances. Nonetheless, acquiring e-mail addresses and then sending junk e-mail (or any other kind of unsolicited e-mail) is considered to be a grievous breach of etiquette and is a practice that should never be adopted. |
This chapter has revealed some very powerful techniques by which you can determine a great deal of information about the visitors to your site. However, as the saying goes, "With great power comes great responsibility," and this topic is no exception to this axiom. Privacy is one of the most important issues that people must address when using the Internet. As Web developers, we must always strive to never compromise the privacy of our audience, for the benefit of the industry as a whole. People use the Internet exactly as much as they trust it-no more. A single case of one user's privacy being compromised can reduce the level of trust of all users immeasurably.
Some excellent on-line resources on these topics include the following:
http://www.yahoo.com/Government/Law/Privacy/
http://www.anu.edu.au/people/Roger.Clarke/DV/
http://www.uiuc.edu/~ejk/WWW-privacy.html
You can access all of the code listings in this chapter, with accompanying executables, by visiting
http://www.anadas.com/cgiunleashed/trackuser/
The site is shown in Figure 21.5.
Figure 21.5: Screen shot of the Web site which contains the listings for this chapter..
The methods presented in this chapter will allow you to track just about every piece of information which is available about the users who access your Web site. Only you will be able to determine which bits of data are the most useful to you, and you will most likely want to concentrate on tracking those. Note that summarizing raw data is the key to creating truly useful demographic reports. While there are a finite number of types of this raw data, there are many more ways in which you can summarize the data into cumulative categories, emphasizing the interrelationships within the data over the bare facts themselves. In other words, this is only the beginning. Good luck!