Many people liken a Web site to a brochure or a "billboard on the Information Superhighway." Those folks may be missing one of the fundamental principles of what works on the Internet: content. Internet users are, for the most part, an intelligent, curious, upscale audience. When they want to know about a subject, they want to know about a subject. For the site to be effective, it must be rich in content, and the content must stay current.
This chapter describes a set of processes which, if applied monthly, keep the site current and effective. Those processes include log analysis-to find out how visitors are using the site, content update-to keep the site fresh, and revalidation-to keep the site usable.
To illustrate this principle, consider a site whose owner recently asked for help from the members of the HTML Writers Guild. His lament was, in essence, "I built it, and they didn't come."
On examination, his site proved to be an advertisement for what can only be called "cheap jewelry." He invited visitors to buy gold jewelry at deeply discounted prices. His design didn't anticipate that users can change their font size-at larger font sizes, his tables showing just how deep the discounts are become unreadable. The site was a bit garish-a yellow on purple color scheme with blinking tags to catch the eye. But the main problem wasn't the execution-it was the premise. Why would anyone part with several hundred dollars to a total stranger who claims to sell "cheap gold"?
Plenty of people are selling jewelry over the Web, of course, and this site could have been effective. A better approach might have been to start out explaining how the gold business works-where it comes from, and why it costs what it does. Then show the visitor how and why some gold jewelry can be sold at deep discounts and still be high quality-by cutting out the middleman. Finally, show the visitor a few quality pieces that are for sale.
If a site is rich in content, visitors will come to it to learn about the subject-whether it's real estate, jewelry, or peanuts. The Virginia Diner site is a good example, at http://www.infi.net/vadiner/. The Virginia Diner is a small restaurant in a small town in rural Virginia. By using the Internet, the restaurant does a booming business in gourmet peanuts. Its site (shown in Figs. 41.1 through 41.4) is rich in content about the history and uses of the humble peanut. And, by the way, if reading about these peanuts has got you curious or starts your mouth watering, the diner will sell you some (as in Fig. 41.4).
Figure 41.1: The Virginia Diner Welcome Page leads the visitor rapidly into the site content.
As soon as the visitor comes to the welcome page, he or she is lured away by promises of sales, catalogs, and content. Many sites try to tell their whole story on the first page. Virginia Diner has made their first pages into a "links" page, which draws the visitor deeper into the site quickly. If a visitor is not ready to vuy right away, perhaps they would like to order a catalog. If they need a bit more time to become comfortable with the material, they can visit the content pages, such as the one showing how to roast peanuts (shown in Fig. 41.2) or the "Interesting Peanut Facts" shown in Figure 41.3.
Figure 41.2: Part of the Virginia Diner's rich content is a page about how to use their product.
Tip |
Some site owners think that by putting up ten, twenty, or more pages of content that they're putting up too much material. They argue that "the visitor will never want to wade through all that material." One lesson of the Web is that many visitors do want to read all of the material-that's the nature of the Internet audience. Many others will read at least part of the material. Remember to keep the internal hyperlinks current so a user can go directly to the pages that interest them. Use the logs and the page access counts to find pages that are seldom accessed, or that are consistently accessed for only a few seconds before the visitor moves on. Improve these pages, give more visibility to their links, or delete them. |
If a visitor to the site is not ready to buy, one good strategy is to keep them around until they are ready. The Virginia Diner site provides lots of content. If, after reviewing the content, the visitor is still not ready to order peanuts online, they can order a paper catalog using the page shown in Figure 41.4, so they can have the full product line available offline whenever they are ready to buy. For a low-cost, impulse purchase like peanuts, promoting the paper catalog was a master stroke!
After the site is up, make sure that the "golden" version is safe in the Configuration Control System such as SCCS or RCS, introduced in Chapter 1, "How to Make a Good Site Look Great." Check out a read-only copy and print off two copies of every page. One copy goes in a hardcover binder on the shelf. The other goes to the client. Assign the client a "maintenance day"-for sake of illustration, say that it's the first Tuesday of every month. Ask the client to update his or her content every month-whether it's a new fact for a content page or a new featured product of the month. Something should change at least once a month.
While the client is away reviewing his or her site and thinking about what to change, begin to track the performance of the site. Chapter 42, "Processing Logs and Analyzing Site Use," looks at some off-the-Net log analysis tools. Before getting into those, let's look at the log itself.
Figure 41.5 shows a typical access log. Something like this file is stored on nearly every NCSA or Apache server, in the logs directory.
Figure 41.5: The common log format captures every request, or hit, which comes to the site.
Look at the log in Figure 41.5 in detail. Notice that it shows every time a visitor accessed the site. Most of the entries are GET, but a few may be POST. The point is that they show every access. When a user pulls down a site with, say, five graphics, you see him request the page, and then all five graphics. Each line in the log constitutes a hit.
At one time it was common to run a command such as this one:
wc -l access_log
The number that would come back, the hit count, was inevitably a large number. Some Webmasters would tout this number as a mark of their success-"My site had more than 15,000 hits last month." The problem with this approach, of course, is that each page counts for several hits. A page with five graphics generates six hits: one for the page, and one for each graphic. So two sites might have the same number of visitors but vastly different hit rates, depending on how their pages are designed.
One improvement over using the hit rate was to put a counter such as the one shown in Figure 41.6 on the home page.
Figure 41.6: A counter is only meaningful if you know when it was last reset to zero.
The counter is a CGI script that increments once every time someone downloads the page. The counter is an improvement over the hit count, because numbers are directly comparable between one site and the next. The question now becomes, what does the counter tell you, and is this figure what you wanted to measure?
By their nature, counters are often displayed on the page. If
visitors see a counter whose number is low, they may conclude
that the site is unpopular and may not explore further. If the
number is high, however, that by itself doesn't indicate an effective
site. A site with an interesting programming technique may make
the "Cool Site of the Day" list somewhere and become
momentarily inundated with visits. If people are coming to see
the "cool hack," however, they aren't qualified buyers,
and many of them will never be back.
Tip |
Remember that counters not only tell you how many visits you've had, they also tell your visitors. If your intent is to "brag" about high counts, by all means use a counter. If your intent is to understand how visitors are using your site so that you can make it more effective, use log analysis rather than a counter. |
Online auditing agencies such as I-Audit provide a third-party mechanism for reporting access to selected pages in the site. The online auditing agencies suffer from the same problem as counters. The thing they measure, number of visitors per unit of time, is only one of the things you care about-and not the most important thing. By making it so easy to get that one number, they may suck you into thinking that your site's success is determined by the number of visitors. That's not the fault of the auditors. That's your fault, if you allow it to happen.
Figure 41.7 shows a site hooked up for I-Audit, one of the better online auditors. The graphic is downloaded from the I-Audit site, along with an account number. The folks at I-Audit are monitoring their log; every day or so they update their statistics to show the number of visits your site has had. The resulting report is shown in Figure 41.8.
I-Audit handles a tremendous volume of data. Not surprisingly, its statistics often run a few days behind. Furthermore, its algorithms for detecting a visit are conservative. Many Webmasters who compare their logs with their figures from I-Audit believe I-Audit doesn't count all the visitors it should.
To hook a site up to I-Audit, visit http://www.iaudit.com/ and sign up for an account ID. I-Audit will give you some sample code to paste into the footer of your page.
To form your own opinion of I-Audit or other third-party auditors,
you need to be able to read your access log. Most Web servers
keep the log in the format defined by NCSA and Apache servers,
known as the common log format.
Note |
The common log format is defined by the following syntax: host rfc931 authuser date-time request status bytes where host is the name of the host which sent the request. rfc931 is the username, if both the client and the server are using RFC 931 identity-checking. Most of the time identity-checking is off, and the rfc931 field is filled with a dash. authuser contains the username if the page was protected and the user had to issue a valid username. Chapter 17, "How to Keep Portions of the Site Private," shows how to set up user authorization. The date-time is the time (at the server) when the request came in; it is embedded inside square brackets. request contains the request itself, in quotes. The return code from the server is given in the field status. (Return codes are described in detail in Chapter 4, "Designing Faster Sites." The number of bytes transferred is given in bytes (this number does not include the header). |
Chapter 42, "Processing Logs and Analyzing Site Use," talks about automated log analysis tools. For now, let's look at what you can learn from the log by hand.
A typical log line might contain
xyz.com - - [05/Feb/1996:16:47:25 -0500] "GET /gsh/General/4.IndexOfPages.shtml HTTP/1.0" 200 4384
in one long line. Let's look at each field in the following sections.
The first field, xyz.com, is the host from which the request was received. In most cases, this information is as close to the actual user as you can get from the log.
If the server has IdentityCheck turned off (the default), this field has a dash. If IdentityCheck is on, the server asks the system making the request for the identity of the user. The protocol for this discussion is RFC 931. Most systems don't have an RFC 931 daemon running, so the conversation ends there and the field gets a dash. If IdentityCheck is on, and if the distant host has an RFC 931-compliant daemon running, the user name is put into the field. Don't hold your breath.
If the directory in which the file resides was protected by access.conf or .htaccess, this field will contain the name of the authorized user. Otherwise, it's a dash.
This field holds the date and time of the request. The time zone field (that is, -0500) shows the difference between local time and Coordinated Universal Time (CUT), also known as Greenwich Mean Time (GMT). For example, Eastern Standard Time is five hours behind GMT.
Recall from Chapter 4, "Designing Faster Sites," the details of an HTTP request. The text of that request appears in this field.
Likewise, Chapter 4 showed various status codes that the server might reply with. This field shows which status code was returned. Any status other than 200 usually indicates an error.
Finally, the number of bytes returned (not including the headers) is logged. If the request succeeded, this field holds the size of the returned entity. If the request failed, this field holds the size of the error message.
As soon as the site is up and announced, the log will begin to fill with data, and reading it without automated assistance will become a chore. The two handiest tools for manual log analysis are grep and cut.
The UNIX grep command extracts selected lines from a file. For example, suppose that your server includes two sites, nikka and gsh. The URLs are set up so that the site name appears in each request. To see just the gsh data, you would type
grep "/gsh/" access_log | cut -d' ' -f1,7 | more
By piping the output of the grep through the more command, you get better control of how the data is displayed.
The UNIX cut command selects fields from a record. The -d switch sets the field separator character, in this case a space; the -f switch selects which field or fields are listed. To see the host name and the requested file for the gsh site, type
grep gsh access_log | cut -d' ' -f1,7 | grep -v "/graphics/" | more
Now you can begin to ask yourself, what kind of information do I want from the logs? Clearly, you can ask more than just hit rates. First, let's filter out requests for graphics and look only at GETs for pages. Suppose that all your graphics for the site are stored in a directory named graphics. The line
grep gsh access_log | cut -d' ' -f1,7 | grep -v graphics | more
says to limit the output to those lines that don't (-v) mention the directory graphics. Next, subtract out those hits that came from you when you were testing. Suppose that your development host is named foo.com. You can write
grep "/gsh/" access_log | cut -d' ' -f1,7 | grep -v "/graphics/" | grep -v "^foo.com" | more
Begin to examine the data manually; later, you'll automate this task.
This command line is becoming a bit unwieldy. Here's a process to begin to focus this filter. Pick one host who visits frequently. Write a grep-based filter that looks just at that host:
grep "/gsh/" access_log | grep "^bar.com" | cut -d' ' -f1,4,7 | grep -v "/graphics/" | ./analyzer.pl
Now build the Perl script analyzer.pl, as shown in Listing 41.1.
Listing 41.1 analyzer.pl-Computes the Dwell Time
of Each Page for a Given Host
#!/usr/bin/perl require "timelocal.pl"; # how many minutes of dwell time changes the visit? $threshold = 60; $oldTime = 0; $oldURL = ""; while (<STDIN>) { chop; /([\w]+) \[([\w]+\/[\w]+\/[\w]+:[\w]+:[\w]+:[\w]+) (.+)/; $host = $1; $url = $3; $2 =~ /(\d\d)\/(\w\w\w)\/(\d\d\d\d):(\d\d):(\d\d):(\d\d)/; $mday = $1 - 1; $mon = &month($2); $year = $3 - 1900; $hour = $4; $min = $5; $sec = $6; $time = &timelocal($sec, $min, $hour, $mday, $mon, $year); $diffTime = $time - $oldTime; $oldTime = $time; if ($diffTime <= 60 * $threshold) { printf "%4d: %s\n", $diffTime, $oldURL; } else { print "------------------------------------\n"; } $oldURL = $url; } exit; sub month { local($mon) = @_; if ($mon eq "Jan") { 0; } elsif ($mon eq "Feb") { 1; } elsif ($mon eq "Mar") { 2; } elsif ($mon eq "Apr") { 3; } elsif ($mon eq "May") { 4; } elsif ($mon eq "Jun") { 5; } elsif ($mon eq "Jul") { 6; } elsif ($mon eq "Aug") { 7; } elsif ($mon eq "Sep") { 8; } elsif ($mon eq "Oct") { 9; } elsif ($mon eq "Nov") { 10; } elsif ($mon eq "Dec") { 11; } }
To understand this program, take it a section at a time:
# how many minutes of dwell time changes the visit? $threshold = 60; $oldTime = 0; $oldURL = "";
You'll see these variables again; keep them in mind.
Recall that analyzer.pl is designed to look at a log that has already been filtered down to one host, pages only, and just three fields. It reads a line from STDIN and parses out the host name and URL, like so:
while (<STDIN>) { chop; # Look for a pattern that consists of a some characters (the host name) # followed by a space, followed by something in square brackets with # three colons in it (the date-time) followed by some more characters # (the URL). /([\w]+) \[([\w]+\/[\w]+\/[\w]+:[\w]+:[\w]+:[\w]+) (.+)/; $host = $1; $url = $3; Next, the script parses out the components of the date and time. # Take apart the data-time field. The first two numbers are the day of # the month. Then there is a slash. The next three characters are the # name of the month. After another slash, there are four numbers for # the year, a colon, and three pairs of numbers separated by colors. # These figures give the hours, minutes, and seconds, respectively. # We ignore the offset from GMT. We don't need it to compute dwell time. $2 =~ /(\d\d)\/(\w\w\w)\/(\d\d\d\d):(\d\d):(\d\d):(\d\d)/; $mday = $1 - 1; $mon = &month($2); $year = $3 - 1900; $hour = $4; $min = $5; $sec = $6;
The only tricky parts are to remember that Perl's time functions start counting days of the month and months from zero, and the year isn't expected to have the century in it. You write the function &mon to translate between the names of the months and the month numbers:
sub month { local($mon) = @_; if ($mon eq "Jan") { 0; } elsif ($mon eq "Feb") { 1; } elsif ($mon eq "Mar") { 2; } elsif ($mon eq "Apr") { 3; } elsif ($mon eq "May") { 4; } elsif ($mon eq "Jun") { 5; } elsif ($mon eq "Jul") { 6; } elsif ($mon eq "Aug") { 7; } elsif ($mon eq "Sep") { 8; } elsif ($mon eq "Oct") { 9; } elsif ($mon eq "Nov") { 10; } elsif ($mon eq "Dec") { 11; } }
Now the real work begins, shown in the following code. Look up the time of each access (using a number hard to look at but easy to compute with-the number of seconds after the UNIX epoch), and find out how long it has been since the last access. If that number is below the $threshold number of minutes, guess that it was part of the same visit (remember that HTTP is a stateless protocol), and report this time.
$time = &timelocal($sec, $min, $hour, $mday, $mon, $year); $diffTime = $time - $oldTime; $oldTime = $time; if ($diffTime <= 60 * $threshold) { printf "%4d: %s\n", $diffTime, $oldURL; } else { print "-------------------------------------\n"; } $oldURL = $url; } exit;
The amount of time between page changes is called dwell time. It's one measure of how long a user looks at the page. You can't guarantee what the user is doing during those minutes, but you could run this program for all the hosts and develop a little database in an associative array holding the mean dwell time for each page and the variance. Note that you carry the previous line's URL around in $oldURL so that when you print, the dwell time is associated with the page the user was reading, rather than the page he or she changed to.
Typical output from a filter such as analyzer.pl is as follows:
------------------------------------- 0: /gsh/ 8: /gsh/General/2.Credits.shtml 3: /gsh/General/3.Help.shtml 2: /gsh/General/4.IndexOfPages.shtml 2: /gsh/General/3.Help.shtml 0: /gsh/General/2.Credits.shtml 2: /gsh/welcome.html 10: /gsh/General/2.Credits.shtml 1: /gsh/General/3.Help.shtml 1: /gsh/General/4.IndexOfPages.shtml 1: /gsh/General/5.SpecialOffers.shtml 1: /gsh/General/6.MailingList.shtml 4: /gsh/General/5.SpecialOffers.shtml 3: /gsh/General/4.IndexOfPages.shtml 9: /gsh/Buyers/6~1.Warranty.shtml 3: /gsh/listings/1.listings.shtml 10: /gsh/listings/6650EthanAllenLane.shtml 34: /gsh/Homeowners/3.WhyGSH.html 958: /gsh/listings/6650EthanAllenLane.shtml 11: /gsh/listings/1573AdamsDrive.shtml 4: /gsh/listings/1573AdamsDrive.shtml 4: /gsh/listings/1200CambridgeCourt.shtml 5: /gsh/listings/1200CambridgeCourt.shtml . . . ------------------------------------- -------------------------------------
For a different kind of understanding of the dwell time, sort the analyzer output. In UNIX, simply say
./analyze.sh | sort -r | more to get 1886: /gsh/Buyers/6~1.Warranty.shtml ------------------------------------- ------------------------------------- ------------------------------------- 958: /gsh/listings/6650EthanAllenLane.shtml 617: /gsh/Buyers/6~1.Warranty.shtml 542: /gsh/General/4.IndexOfPages.shtml 255: /gsh/listings/432ButterflyDrive.shtml 119: /gsh/ 116: /gsh/listings/1.listings.shtml 103: /gsh/listings/1573AdamsDrive.shtml 91: /gsh/welcome.html 67: /gsh/listings/1210WestWayCT.shtml 65: /gsh/Buyers/1.Buyers.shtml 43: /gsh/General/listings/6650EthanAllenLane.shtml 34: /gsh/Homeowners/3.WhyGSH.html 28: /gsh/listings/1.listings.shtml 15: /gsh/listings/6650EthanAllenLane.shtml 15: /gsh/listings/1.listings.shtml 15: /gsh/General/7.ThankYou.shtml 12: /gsh/welcome 11: /gsh/listings/1573AdamsDrive.shtml 10: /gsh/listings/6650EthanAllenLane.shtml . . .
Notice that this user spent a lot of time looping through the listings. The high time on the Warranty page may reflect an actual interest in the Warranty, or it may signal that the user got up and left the computer for about a half an hour. It's difficult to say with just one visitor.
Another useful tool would look at which pairs of lines occur together. Every time a user on page A goes next to page B, you increment the A-B link counter. After looking at many hosts and accesses, patterns of use emerge. By coupling this information with dwell time, you can start giving reports such as, "A typical user pulls into the GSH site, looks at the welcome page for about 20 seconds, and then goes to the listings pages. They explore the listings, spending an average of two minutes per property."
Automated tools can be built on the framework of analyzer.pl, but much is to be said for spending the first maintenance day or two (such as one day a month for two months) manually going through the data using simple filters such as analyzer itself.
Use the results of this analysis to evaluate the site and give recommendations back to the client. Is the link count low for a particular pair? Maybe the link is buried in an obscure place on the page, or the link is phrased in a way that isn't appealing. Do users often blow right through some of the pages, dwelling for just a few seconds? Maybe the page doesn't meet their expectations, or has an unappealing look. Look at the page again. Consider bringing in one or two people from the Red Team (initially described in Chapter 1, "How to Make a Good Site Look Great") to reevaluate its effectiveness.
Look at what's working, too. Look at those visitors who end up placing an order or requesting additional information. What pages did they see? What patterns emerge? How can you give them more of the kind of pages that visitors seem to be looking for?
Finally, run an analysis looking for interrupted transfers. Build a script that knows about the number of graphics on each page (or learns it for itself by examining the log). Then record the number of graphics fetched after each page. If the number of graphics fetched is smaller than the number of graphics on the page, something happened during the transfer. Perhaps the user got tired of watching a large graphic download and stopped the transfer, or even exited the site. Again, don't draw conclusions from just a few visits, but if the patterns persist, think through the design of the site and see how it can be improved.
While you've been evaluating the site, the client has been thinking about fresh ideas for content. Share the results of the preceding analysis with the client, and give them further recommendations for enhancing the site.
After the client submits new content, check out each affected page (making sure you get write-access) from the Configuration Control System and make the changes. Run the page through any local page checkers, such as WebLint. If any of the CGI or SSI has changed, run a regression test on those functions.
Finally, put the changed site back on the live server.
Once you're on the live server, recheck every page with Doctor HTML, and run any components of the regression test that can't be run on the development machine. In particular, rerunning Doctor will check links. Even if you haven't changed the page, there's no guarantee that external links haven't gone stale.
Now print two new copies of the site. One goes in the binder on the shelf, the other goes to the client. Now you're ready for maintenance again next month.