Chapter 41 How to Keep Them Coming Back for More

Remember, "Content Is King"
Hits Versus Visits
The Common Log Format
Monthly Tasks

Many people liken a Web site to a brochure or a "billboard on the Information Superhighway." Those folks may be missing one of the fundamental principles of what works on the Internet: content. Internet users are, for the most part, an intelligent, curious, upscale audience. When they want to know about a subject, they want to know about a subject. For the site to be effective, it must be rich in content, and the content must stay current.

This chapter describes a set of processes which, if applied monthly, keep the site current and effective. Those processes include log analysis-to find out how visitors are using the site, content update-to keep the site fresh, and revalidation-to keep the site usable.

Remember, "Content Is King"

To illustrate this principle, consider a site whose owner recently asked for help from the members of the HTML Writers Guild. His lament was, in essence, "I built it, and they didn't come."

A Site That Doesn't Work

On examination, his site proved to be an advertisement for what can only be called "cheap jewelry." He invited visitors to buy gold jewelry at deeply discounted prices. His design didn't anticipate that users can change their font size-at larger font sizes, his tables showing just how deep the discounts are become unreadable. The site was a bit garish-a yellow on purple color scheme with blinking tags to catch the eye. But the main problem wasn't the execution-it was the premise. Why would anyone part with several hundred dollars to a total stranger who claims to sell "cheap gold"?

And How to Improve It

Plenty of people are selling jewelry over the Web, of course, and this site could have been effective. A better approach might have been to start out explaining how the gold business works-where it comes from, and why it costs what it does. Then show the visitor how and why some gold jewelry can be sold at deep discounts and still be high quality-by cutting out the middleman. Finally, show the visitor a few quality pieces that are for sale.

Content Is King

If a site is rich in content, visitors will come to it to learn about the subject-whether it's real estate, jewelry, or peanuts. The Virginia Diner site is a good example, at http://www.infi.net/vadiner/. The Virginia Diner is a small restaurant in a small town in rural Virginia. By using the Internet, the restaurant does a booming business in gourmet peanuts. Its site (shown in Figs. 41.1 through 41.4) is rich in content about the history and uses of the humble peanut. And, by the way, if reading about these peanuts has got you curious or starts your mouth watering, the diner will sell you some (as in Fig. 41.4).

Figure 41.1: The Virginia Diner Welcome Page leads the visitor rapidly into the site content.

As soon as the visitor comes to the welcome page, he or she is lured away by promises of sales, catalogs, and content. Many sites try to tell their whole story on the first page. Virginia Diner has made their first pages into a "links" page, which draws the visitor deeper into the site quickly. If a visitor is not ready to vuy right away, perhaps they would like to order a catalog. If they need a bit more time to become comfortable with the material, they can visit the content pages, such as the one showing how to roast peanuts (shown in Fig. 41.2) or the "Interesting Peanut Facts" shown in Figure 41.3.

Figure 41.2: Part of the Virginia Diner's rich content is a page about how to use their product.

Figure 41.3: The Virginia Diner provides interesting facts about a subject most people consider trivial.

Tip

Some site owners think that by putting up ten, twenty, or more pages of content that they're putting up too much material. They argue that "the visitor will never want to wade through all that material." One lesson of the Web is that many visitors do want to read all of the material-that's the nature of the Internet audience. Many others will read at least part of the material. Remember to keep the internal hyperlinks current so a user can go directly to the pages that interest them.

Use the logs and the page access counts to find pages that are seldom accessed, or that are consistently accessed for only a few seconds before the visitor moves on. Improve these pages, give more visibility to their links, or delete them.

If a visitor to the site is not ready to buy, one good strategy is to keep them around until they are ready. The Virginia Diner site provides lots of content. If, after reviewing the content, the visitor is still not ready to order peanuts online, they can order a paper catalog using the page shown in Figure 41.4, so they can have the full product line available offline whenever they are ready to buy. For a low-cost, impulse purchase like peanuts, promoting the paper catalog was a master stroke!

Figure 41.4: Visitors can order the Virginia Diner's Gourmet Peanut catalog, so they can order peanuts whenever the mood strikes them.

Keeping the Content Up-to-Date

After the site is up, make sure that the "golden" version is safe in the Configuration Control System such as SCCS or RCS, introduced in Chapter 1, "How to Make a Good Site Look Great." Check out a read-only copy and print off two copies of every page. One copy goes in a hardcover binder on the shelf. The other goes to the client. Assign the client a "maintenance day"-for sake of illustration, say that it's the first Tuesday of every month. Ask the client to update his or her content every month-whether it's a new fact for a content page or a new featured product of the month. Something should change at least once a month.

Hits Versus Visits

While the client is away reviewing his or her site and thinking about what to change, begin to track the performance of the site. Chapter 42, "Processing Logs and Analyzing Site Use," looks at some off-the-Net log analysis tools. Before getting into those, let's look at the log itself.

Visiting the Log

Figure 41.5 shows a typical access log. Something like this file is stored on nearly every NCSA or Apache server, in the logs directory.

Figure 41.5: The common log format captures every request, or hit, which comes to the site.

What Are You Counting?

Look at the log in Figure 41.5 in detail. Notice that it shows every time a visitor accessed the site. Most of the entries are GET, but a few may be POST. The point is that they show every access. When a user pulls down a site with, say, five graphics, you see him request the page, and then all five graphics. Each line in the log constitutes a hit.

Examining the Log

At one time it was common to run a command such as this one:

wc -l access_log

The number that would come back, the hit count, was inevitably a large number. Some Webmasters would tout this number as a mark of their success-"My site had more than 15,000 hits last month." The problem with this approach, of course, is that each page counts for several hits. A page with five graphics generates six hits: one for the page, and one for each graphic. So two sites might have the same number of visitors but vastly different hit rates, depending on how their pages are designed.

Counters

One improvement over using the hit rate was to put a counter such as the one shown in Figure 41.6 on the home page.

Figure 41.6: A counter is only meaningful if you know when it was last reset to zero.

The counter is a CGI script that increments once every time someone downloads the page. The counter is an improvement over the hit count, because numbers are directly comparable between one site and the next. The question now becomes, what does the counter tell you, and is this figure what you wanted to measure?

By their nature, counters are often displayed on the page. If visitors see a counter whose number is low, they may conclude that the site is unpopular and may not explore further. If the number is high, however, that by itself doesn't indicate an effective site. A site with an interesting programming technique may make the "Cool Site of the Day" list somewhere and become momentarily inundated with visits. If people are coming to see the "cool hack," however, they aren't qualified buyers, and many of them will never be back.

Tip

Remember that counters not only tell you how many visits you've had, they also tell your visitors. If your intent is to "brag" about high counts, by all means use a counter. If your intent is to understand how visitors are using your site so that you can make it more effective, use log analysis rather than a counter.

Using an Online Auditing Service

Online auditing agencies such as I-Audit provide a third-party mechanism for reporting access to selected pages in the site. The online auditing agencies suffer from the same problem as counters. The thing they measure, number of visitors per unit of time, is only one of the things you care about-and not the most important thing. By making it so easy to get that one number, they may suck you into thinking that your site's success is determined by the number of visitors. That's not the fault of the auditors. That's your fault, if you allow it to happen.

Figure 41.7 shows a site hooked up for I-Audit, one of the better online auditors. The graphic is downloaded from the I-Audit site, along with an account number. The folks at I-Audit are monitoring their log; every day or so they update their statistics to show the number of visits your site has had. The resulting report is shown in Figure 41.8.

Figure 41.7: On a site hooked up to I-Audit, the I-Audit log is updated when the I-Audit graphic is downloaded.

Figure 41.8: The I-Audit Report shows how often a page on the site was requested in a given time period.

I-Audit handles a tremendous volume of data. Not surprisingly, its statistics often run a few days behind. Furthermore, its algorithms for detecting a visit are conservative. Many Webmasters who compare their logs with their figures from I-Audit believe I-Audit doesn't count all the visitors it should.

To hook a site up to I-Audit, visit http://www.iaudit.com/ and sign up for an account ID. I-Audit will give you some sample code to paste into the footer of your page.

The Common Log Format

To form your own opinion of I-Audit or other third-party auditors, you need to be able to read your access log. Most Web servers keep the log in the format defined by NCSA and Apache servers, known as the common log format.

Note

The common log format is defined by the following syntax:

host rfc931 authuser date-time request status bytes

where host is the name of the host which sent the request. rfc931 is the username, if both the client and the server are using RFC 931 identity-checking. Most of the time identity-checking is off, and the rfc931 field is filled with a dash. authuser contains the username if the page was protected and the user had to issue a valid username. Chapter 17, "How to Keep Portions of the Site Private," shows how to set up user authorization. The date-time is the time (at the server) when the request came in; it is embedded inside square brackets. request contains the request itself, in quotes. The return code from the server is given in the field status. (Return codes are described in detail in Chapter 4, "Designing Faster Sites." The number of bytes transferred is given in bytes (this number does not include the header).

Chapter 42, "Processing Logs and Analyzing Site Use," talks about automated log analysis tools. For now, let's look at what you can learn from the log by hand.

A typical log line might contain

xyz.com - - [05/Feb/1996:16:47:25 -0500] "GET /gsh/General/4.IndexOfPages.shtml 
HTTP/1.0" 200 4384

in one long line. Let's look at each field in the following sections.

`host`

The first field, xyz.com, is the host from which the request was received. In most cases, this information is as close to the actual user as you can get from the log.

`RFC931`

If the server has IdentityCheck turned off (the default), this field has a dash. If IdentityCheck is on, the server asks the system making the request for the identity of the user. The protocol for this discussion is RFC 931. Most systems don't have an RFC 931 daemon running, so the conversation ends there and the field gets a dash. If IdentityCheck is on, and if the distant host has an RFC 931-compliant daemon running, the user name is put into the field. Don't hold your breath.

`authuser`

If the directory in which the file resides was protected by access.conf or .htaccess, this field will contain the name of the authorized user. Otherwise, it's a dash.

`date-time`

This field holds the date and time of the request. The time zone field (that is, -0500) shows the difference between local time and Coordinated Universal Time (CUT), also known as Greenwich Mean Time (GMT). For example, Eastern Standard Time is five hours behind GMT.

`request`

Recall from Chapter 4, "Designing Faster Sites," the details of an HTTP request. The text of that request appears in this field.

`status`

Likewise, Chapter 4 showed various status codes that the server might reply with. This field shows which status code was returned. Any status other than 200 usually indicates an error.

`bytes`

Finally, the number of bytes returned (not including the headers) is logged. If the request succeeded, this field holds the size of the returned entity. If the request failed, this field holds the size of the error message.

Analyzing the Log

As soon as the site is up and announced, the log will begin to fill with data, and reading it without automated assistance will become a chore. The two handiest tools for manual log analysis are grep and cut.

The UNIX grep command extracts selected lines from a file. For example, suppose that your server includes two sites, nikka and gsh. The URLs are set up so that the site name appears in each request. To see just the gsh data, you would type

grep "/gsh/" access_log | cut -d' ' -f1,7 | more

By piping the output of the grep through the more command, you get better control of how the data is displayed.

The UNIX cut command selects fields from a record. The -d switch sets the field separator character, in this case a space; the -f switch selects which field or fields are listed. To see the host name and the requested file for the gsh site, type

grep gsh access_log | cut -d' ' -f1,7 | grep -v "/graphics/" | more

Now you can begin to ask yourself, what kind of information do I want from the logs? Clearly, you can ask more than just hit rates. First, let's filter out requests for graphics and look only at GETs for pages. Suppose that all your graphics for the site are stored in a directory named graphics. The line

grep gsh access_log | cut -d' ' -f1,7 | grep -v graphics | more

says to limit the output to those lines that don't (-v) mention the directory graphics. Next, subtract out those hits that came from you when you were testing. Suppose that your development host is named foo.com. You can write

grep "/gsh/" access_log | cut -d' ' -f1,7 | grep -v "/graphics/" |
grep -v "^foo.com" | more

Begin to examine the data manually; later, you'll automate this task.

This command line is becoming a bit unwieldy. Here's a process to begin to focus this filter. Pick one host who visits frequently. Write a grep-based filter that looks just at that host:

grep "/gsh/" access_log | grep "^bar.com" | cut -d' ' -f1,4,7 |
grep -v "/graphics/" | ./analyzer.pl

Now build the Perl script analyzer.pl, as shown in Listing 41.1.

Listing 41.1 analyzer.pl-Computes the Dwell Time of Each Page for a Given Host

#!/usr/bin/perl
require "timelocal.pl";
# how many minutes of dwell time changes the visit?
$threshold = 60;
$oldTime = 0;
$oldURL = "";
while (<STDIN>)
{
 chop;
 /([\w]+) \[([\w]+\/[\w]+\/[\w]+:[\w]+:[\w]+:[\w]+) (.+)/;
 $host = $1;
 $url  = $3;
 $2 =~ /(\d\d)\/(\w\w\w)\/(\d\d\d\d):(\d\d):(\d\d):(\d\d)/;
 $mday = $1 - 1;
 $mon  = &month($2);
 $year = $3 - 1900;
 $hour = $4;
 $min  = $5;
 $sec  = $6;
 $time = &timelocal($sec, $min, $hour, $mday, $mon, $year);
 $diffTime = $time - $oldTime;
 $oldTime = $time;
 if ($diffTime <= 60 * $threshold)
 {
    printf "%4d: %s\n", $diffTime, $oldURL;
 }
 else
 {
    print "------------------------------------\n";
 }
$oldURL = $url;
}
exit;
sub month
{
  local($mon) = @_;
  if ($mon eq "Jan")
  {
    0;
  } elsif ($mon eq "Feb")
  {
    1;
  } elsif ($mon eq "Mar")
  {
    2;
  } elsif ($mon eq "Apr")
  {
    3;
  } elsif ($mon eq "May")
  {
    4;
  } elsif ($mon eq "Jun")
  {
    5;
  } elsif ($mon eq "Jul")
  {
    6;
  } elsif ($mon eq "Aug")
  {
    7;
  } elsif ($mon eq "Sep")
  {
    8;
  } elsif ($mon eq "Oct")
  {
    9;
  } elsif ($mon eq "Nov")
  {
    10;
  } elsif ($mon eq "Dec")
  {
    11;
  }
  }

To understand this program, take it a section at a time:

# how many minutes of dwell time changes the visit?
$threshold = 60;
$oldTime = 0;
$oldURL = "";

You'll see these variables again; keep them in mind.

Recall that analyzer.pl is designed to look at a log that has already been filtered down to one host, pages only, and just three fields. It reads a line from STDIN and parses out the host name and URL, like so:

while (<STDIN>)
{
 chop;
# Look for a pattern that consists of a some characters (the host name) 
# followed by a space, followed by something in square brackets with 
# three colons in it (the date-time) followed by some more characters
# (the URL). 
 /([\w]+) \[([\w]+\/[\w]+\/[\w]+:[\w]+:[\w]+:[\w]+) (.+)/;
 $host = $1;
 $url  = $3;
Next, the script parses out the components of the date and time.
# Take apart the data-time field. The first two numbers are the day of
# the month. Then there is a slash. The next three characters are the 
# name of the month. After another slash, there are four numbers for
# the year, a colon, and three pairs of numbers separated by colors.
# These figures give the hours, minutes, and seconds, respectively.
# We ignore the offset from GMT. We don't need it to compute dwell time.
 $2 =~ /(\d\d)\/(\w\w\w)\/(\d\d\d\d):(\d\d):(\d\d):(\d\d)/;
 $mday = $1 - 1;
 $mon  = &month($2);
 $year = $3 - 1900;
 $hour = $4;
 $min  = $5;
 $sec  = $6;

The only tricky parts are to remember that Perl's time functions start counting days of the month and months from zero, and the year isn't expected to have the century in it. You write the function &mon to translate between the names of the months and the month numbers:

sub month
{
  local($mon) = @_;
  if ($mon eq "Jan")
  {
    0;
  } elsif ($mon eq "Feb")
  {
    1;
  } elsif ($mon eq "Mar")
  {
    2;
  } elsif ($mon eq "Apr")
  {
    3;
  } elsif ($mon eq "May")
  {
    4;
  } elsif ($mon eq "Jun")
  {
    5;
  } elsif ($mon eq "Jul")
  {
    6;
  } elsif ($mon eq "Aug")
  {
    7;
  } elsif ($mon eq "Sep")
  {
    8;
  } elsif ($mon eq "Oct")
  {
    9;
  } elsif ($mon eq "Nov")
  {
    10;
  } elsif ($mon eq "Dec")
  {
    11;
  }
}

Now the real work begins, shown in the following code. Look up the time of each access (using a number hard to look at but easy to compute with-the number of seconds after the UNIX epoch), and find out how long it has been since the last access. If that number is below the $threshold number of minutes, guess that it was part of the same visit (remember that HTTP is a stateless protocol), and report this time.

 $time = &timelocal($sec, $min, $hour, $mday, $mon, $year);
 $diffTime = $time - $oldTime;
 $oldTime = $time;
 if ($diffTime <= 60 * $threshold)
 {
    printf "%4d: %s\n", $diffTime, $oldURL;
 }
 else
 {
    print "-------------------------------------\n";
 }
$oldURL = $url;
}
exit;

The amount of time between page changes is called dwell time. It's one measure of how long a user looks at the page. You can't guarantee what the user is doing during those minutes, but you could run this program for all the hosts and develop a little database in an associative array holding the mean dwell time for each page and the variance. Note that you carry the previous line's URL around in $oldURL so that when you print, the dwell time is associated with the page the user was reading, rather than the page he or she changed to.

Typical output from a filter such as analyzer.pl is as follows:

-------------------------------------
   0: /gsh/
   8: /gsh/General/2.Credits.shtml
   3: /gsh/General/3.Help.shtml
   2: /gsh/General/4.IndexOfPages.shtml
   2: /gsh/General/3.Help.shtml
   0: /gsh/General/2.Credits.shtml
   2: /gsh/welcome.html
  10: /gsh/General/2.Credits.shtml
   1: /gsh/General/3.Help.shtml
   1: /gsh/General/4.IndexOfPages.shtml
   1: /gsh/General/5.SpecialOffers.shtml
   1: /gsh/General/6.MailingList.shtml
   4: /gsh/General/5.SpecialOffers.shtml
   3: /gsh/General/4.IndexOfPages.shtml
   9: /gsh/Buyers/6~1.Warranty.shtml
   3: /gsh/listings/1.listings.shtml
  10: /gsh/listings/6650EthanAllenLane.shtml
  34: /gsh/Homeowners/3.WhyGSH.html
 958: /gsh/listings/6650EthanAllenLane.shtml
  11: /gsh/listings/1573AdamsDrive.shtml
   4: /gsh/listings/1573AdamsDrive.shtml
   4: /gsh/listings/1200CambridgeCourt.shtml
   5: /gsh/listings/1200CambridgeCourt.shtml
.
.
.
-------------------------------------
-------------------------------------

For a different kind of understanding of the dwell time, sort the analyzer output. In UNIX, simply say

./analyze.sh | sort -r | more
to get
1886: /gsh/Buyers/6~1.Warranty.shtml
-------------------------------------
-------------------------------------
-------------------------------------
 958: /gsh/listings/6650EthanAllenLane.shtml
 617: /gsh/Buyers/6~1.Warranty.shtml
 542: /gsh/General/4.IndexOfPages.shtml
 255: /gsh/listings/432ButterflyDrive.shtml
 119: /gsh/
 116: /gsh/listings/1.listings.shtml
 103: /gsh/listings/1573AdamsDrive.shtml
  91: /gsh/welcome.html
  67: /gsh/listings/1210WestWayCT.shtml
  65: /gsh/Buyers/1.Buyers.shtml
  43: /gsh/General/listings/6650EthanAllenLane.shtml
  34: /gsh/Homeowners/3.WhyGSH.html
  28: /gsh/listings/1.listings.shtml
  15: /gsh/listings/6650EthanAllenLane.shtml
  15: /gsh/listings/1.listings.shtml
  15: /gsh/General/7.ThankYou.shtml
  12: /gsh/welcome
  11: /gsh/listings/1573AdamsDrive.shtml
  10: /gsh/listings/6650EthanAllenLane.shtml
.
.
.

Notice that this user spent a lot of time looping through the listings. The high time on the Warranty page may reflect an actual interest in the Warranty, or it may signal that the user got up and left the computer for about a half an hour. It's difficult to say with just one visitor.

Another useful tool would look at which pairs of lines occur together. Every time a user on page A goes next to page B, you increment the A-B link counter. After looking at many hosts and accesses, patterns of use emerge. By coupling this information with dwell time, you can start giving reports such as, "A typical user pulls into the GSH site, looks at the welcome page for about 20 seconds, and then goes to the listings pages. They explore the listings, spending an average of two minutes per property."

Automated tools can be built on the framework of analyzer.pl, but much is to be said for spending the first maintenance day or two (such as one day a month for two months) manually going through the data using simple filters such as analyzer itself.

Use the results of this analysis to evaluate the site and give recommendations back to the client. Is the link count low for a particular pair? Maybe the link is buried in an obscure place on the page, or the link is phrased in a way that isn't appealing. Do users often blow right through some of the pages, dwelling for just a few seconds? Maybe the page doesn't meet their expectations, or has an unappealing look. Look at the page again. Consider bringing in one or two people from the Red Team (initially described in Chapter 1, "How to Make a Good Site Look Great") to reevaluate its effectiveness.

Look at what's working, too. Look at those visitors who end up placing an order or requesting additional information. What pages did they see? What patterns emerge? How can you give them more of the kind of pages that visitors seem to be looking for?

Finally, run an analysis looking for interrupted transfers. Build a script that knows about the number of graphics on each page (or learns it for itself by examining the log). Then record the number of graphics fetched after each page. If the number of graphics fetched is smaller than the number of graphics on the page, something happened during the transfer. Perhaps the user got tired of watching a large graphic download and stopped the transfer, or even exited the site. Again, don't draw conclusions from just a few visits, but if the patterns persist, think through the design of the site and see how it can be improved.

Monthly Tasks

While you've been evaluating the site, the client has been thinking about fresh ideas for content. Share the results of the preceding analysis with the client, and give them further recommendations for enhancing the site.

Updating the Site

After the client submits new content, check out each affected page (making sure you get write-access) from the Configuration Control System and make the changes. Run the page through any local page checkers, such as WebLint. If any of the CGI or SSI has changed, run a regression test on those functions.

Finally, put the changed site back on the live server.

Revalidating the Site

Once you're on the live server, recheck every page with Doctor HTML, and run any components of the regression test that can't be run on the development machine. In particular, rerunning Doctor will check links. Even if you haven't changed the page, there's no guarantee that external links haven't gone stale.

Now print two new copies of the site. One goes in the binder on the shelf, the other goes to the client. Now you're ready for maintenance again next month.

Chapter 41

How to Keep Them Coming Back for More

CONTENTS