Chapter 3 Standard HTML

If Everyone's Using Navigator, Why Should I Worry About Standards?
Validating and Checking HTML
Validator-Like Tools
Interactive Pages in Standard HTML
Cookies
Developing Client-Server Applications
What's Next for HTML?
In the SDK…

A running joke in the software development field is that software engineering is the only branch of engineering in which adding a new wing to a building is considered "maintenance." By some estimates, software maintenance changes that are made to the product after it is released for use account for almost 90 percent of the lifetime cost of the product.

Software maintenance occurs for several reasons. Some maintenance actions occur to fix latent defect bugs. Sometimes maintenance must be performed to keep software up-to-date with new standards, or with changes in other components of the system. Of course, sometimes a product is changed to add a new feature that is requested by the users.

Regardless of why it occurs, changing a product that is already in the field is expensive. Thorough testing can reduce the number of defects, eliminating some maintenance costs. Netscape ONE applications incorporate not only dynamic content (such as any program might have), but also static content (HTML). Validating HTML to make sure it meets the standard makes it less likely that a change in some browser will force the developer to recode the page. Reducing these costs allows the developer to offer site development at lower cost, and the site owner can spend more on content maintenance, which builds traffic.

If Everyone's Using Navigator, Why Should I Worry About Standards?

In early 1995, the fledgling Web marketplace was dominated by Mosaic. By the end of 1995, Netscape Navigator had acquired over 70 percent of the market, and many people believed that number inevitably would move to 100 percent. By mid-1996, Microsoft had entered the market. Their product, Microsoft Internet Explorer (MSIE), is largely a Navigator clone and has seized back about 30 percent of the market from Navigator.

Each of the graphical Web browsers (Navigator, MSIE, as well as Mosaic and others) interprets HTML into images and text on the screen of a Windows, Macintosh, or other desktop computer. This chapter concentrates on the portions of HTML that are common to all Web browsers. Chapter 4, "Netscape Enhancements to HTML," describes tags and attributes that are supported by Navigator and its clones, which are not (yet) part of standard HTML.

Navigator Is Based on Standard HTML

Netscape Communications Corporation has repeatedly announced its commitment to open standards, including HTML. While its products support a superset of the open standards, Netscape has participated in the standardization process; Netscape has presented its enhancements to the Web standards community. In many cases, these enhancements have been adopted into the standard; HTML 3.2 includes several concepts that were first introduced in early versions of Navigator.

Document Type Definitions and Why You Care About Them The Hypertext Markup Language, or HTML, is not a programming language or a desktop publishing language. It is a language for describing the structure of a document. Using HTML, users can identify headlines, paragraphs, and major divisions of a work.

HTML is the result of many hours of work by members of various working groups of the Internet Engineering Task Force (IETF), with support from the World Wide Web Consortium (W3C). Participation in these working groups is open to anyone who wishes to volunteer. Any output of the working groups is submitted to international standards organizations as a proposed standard. Once enough time has passed for public comment, the proposed standard becomes a draft, and eventually might be published as a standard. HTML Level 2 has been approved by the Internet Engineering Steering Group (IESG) to be released as Proposed Standard RFC 1866. (As if the open review process weren't clear enough, RFC in proposed standard names stands for Request For Comments.)

The developers of HTML used the principles of a meta-language, the Standard Generalized Markup Language (SGML). SGML may be thought of as a toolkit for markup languages. One feature of SGML is the capability to identify within the document which of many languages and variants was used to build the document.

Each SGML language has a formal description designed to be read by computer. These descriptions are called Document Type Definitions (DTDs). An HTML document can declare for which level of HTML it was written by using a DOCTYPE tag as its first line. For example, an HTML 3.0 document starts with the following:

<!DOCTYPE HTML PUBLIC     "-//IETF//DTD HTML 3.0//EN">_

The DOCTYPE tag is read by validators and other software. It's available for use by browsers and SGML-aware editors, although it's not generally used by those kinds of software. If the DOCTYPE tag is missing, the software reading the document assumes that the document is HTML 2.0.

DOCTYPE tags are used to cue document readers about what type of markup language is being used. Table 3.1 lists the most common DOCTYPE lines and their corresponding HTML levels.

Table 3.1 DOCTYPE Tags

DOCTYPE	Level
`<!DOCTYPE HTML PUBLIC`	`"-//IETF//DTD HTML 2.0//EN"> 2.0`
`<!DOCTYPE HTML PUBLIC`	`"-//IETF//DTD HTML 3.0//EN"> 3.0`
`<!DOCTYPE HTML PUBLIC`	`"-//Netscape Comm. Corp.//DTD HTML//EN">Netscape`

Bad HTML Breaks Browsers (Even Navigator)

The best kind of maintenance is the kind that improves the site-by adding new content and features that attract new visitors and encourage people to come back again and again. This kind of maintenance usually takes a lower priority compared to the tasks of defect removal and keeping the site up-to-date with the browsers. One key to building an effective site is to keep the maintenance costs low so plenty of resources are available to improve the site and, consequently, build traffic.

On the Web, severe software defects are rare. One reason for this is that HTML is not a programming language, so many opportunities a programmer might have to introduce defects are eliminated. Another reason is that browsers are forgiving by design. If you write bad C++ and feed it to a C++ compiler, chances are high that the compiler will issue a warning or even an error. If you write bad HTML, on the other hand, a browser will try its best to put something meaningful on-screen. This behavior is commonly referred to as the Internet robustness principle: "Be liberal about what you accept, and conservative about what you produce."

The Internet robustness principle can be a good thing. If you write poor HTML and don't want your clients to know, this principle can hide many of your errors. In general, though, living at the mercy of a browser's error-handling routines is bad for the following reasons:

Not everyone is using the same browser HTML that survives in your browser might break in another.
You want your code to last As time goes on, browsers must become stricter. Code that works in Level 1 might break in Level 2.
Browser authors don't always do error handling correctly You might make an error that confuses the browser, with unpredictable results.
Browsers are not the only pieces of software that read the HTML on your site Increasingly, robots are trying to understand your HTML. Robots are online programs that explore the Web looking for new sites. When they visit a site, they typically categorize each page (and sometimes the site as a whole) and include their findings in online databases. See Webcrawler (http://www.webcrawler.com/) for a good example of what a robot can find.
The standards change Next year, features will be supported in HTML that have scarcely been thought of today. As you add new features to your site, problems that browsers ignored before might start to confuse the browsers.

If you could write each page once, and leave it alone forever, then maybe you could take the time to perfect each line of HTML. If your site is being actively used, however, then it is being changed-or should be.

The most effective Web sites are those that invite two-way communication with the visitor. Remember the principle content is king. Web visitors crave information from your site. One way to draw them back to the site is to offer new, fresh information regularly. If a client posts new information every few weeks, people will return to the site. If the client posts new information daily, people will stampede back to the site. The expert Webmaster must deal with all the new content, while still ensuring that each page is valid, high-quality HTML.

Validating and Checking HTML

This section shows how to use various tools to ensure that your HTML is as perfect as possible when the site is initially developed. Use these same tools regularly to make sure your maintenance activities haven't "broken" the page. Some of these tools also check external links to make sure that pages referenced by your site have not moved or "gone dark."

Strictly speaking, "validation" refers to ensuring that the HTML code complies with approved standards. More generally, validator-like tools are available to check for consistency and good practice as well as compliance with the standards.

How to Validate Your Web Pages

The fastest and easiest way to validate a Web site is to submit each page to an online program known as a "validator." This section shows how the first validator, known as HALsoft, works. Although there are other validators that are better for most Webmasters, understanding HALsoft gives you an appreciation of the newer validators such as Gerald Oskoboiny's Kinder Gentler Validator.

HALSoft and the WebTech Validator As the original Web validator, the WebTech validator is the standard by which other validators are judged. Unfortunately, the output of the WebTech program is not always clear. It reports errors in terms of the SGML standard-not a particularly useful reference for most Web designers.

ON THE WEB

http://www.webtechs.com/html-val-svc/ The HALsoft validator was the first formal validator widely available on the Web. In January 1996, the HALsoft validator moved to WebTech and is now available at this site.

Listing 3.1 gives an example of a piece of flawed HTML and the corresponding error messages that were returned from the WebTech validator.

Listing 3.1 -An Example of Invalid HTML

<!DOCTYPE HTML PUBLIC     "-//IETF//DTD HTML 3.0//EN">
<HEAD>
<TITLE>Test</TITLE>
<BODY BACKGROUND="Graphics/white.gif>
<H1>This is header one</H1>
<P>
This document is about nothing at all.
<P>
But the HTML is not much good!
</BODY>
</HTML>

produces the following:

Errors
sgmls: SGML error at -, line 4 at "B":
      Possible attributes treated as data because none were defined

The Netscape attribute (BACKGROUND) is flagged by the validator as an unrecognizable attribute. The missing closing tag for the HEAD doesn't help much, either, but it's not an error (because the standard states that the HEAD is implicitly closed by the beginning of the BODY). Even though it's not a violation of the standard, it's certainly poor practice-this kind of problem will be flagged by Weblint, described later in this chapter.

The WebTech validator gives you the option of validating against any of several standards, including:

HTML Level 2
HTML Level 3
HTML Level 3.2
Mozilla (HTML with Netscape extensions)
Microsoft Internet Explorer

HTML Level 2 is "plain vanilla" HTML. There were once HTML Level 0 and Level 1 standards, but the current base for all popular browsers is HTML Level 2 (also known as RFC 1866).

Each level of HTML tries to maintain backward compatibility with its predecessors, but using older features is rarely wise. The HTML working groups regularly deprecate features of previous levels. The notation Strict on a language level says that deprecated features are not allowed. Validators allow you to specify "strict" checking, generally with a check box.

HTML Level 3 represents a bit of a problem. Shortly after HTML Level 2 stabilized, developers put together a list of good ideas that didn't make it into Level 2. This list became known as HTML+. The HTML Working Group used HTML+ as the starting point for developing HTML Level 3. A written description and a DTD were prepared for HTML Level 3, but it quickly became apparent that there were more good ideas than there was time or volunteers to implement them. In March 1995, the HTML Level 3 draft was allowed to expire and the components of HTML Level 3 were divided among several working groups. Some of these groups, like the one on tables, released recommendations quickly. The tables portion of the standard has been adopted by several popular browsers. Other groups, such as the one on style sheets, have been slower to release a stable recommendation. A version of the Cascading Style Sheet level 1 standard (CSS1) has been adopted by Netscape in Navigator 4.0, and by Microsoft in Internet Explorer 3.0.

ON THE WEB

http://www.w3.org/pub/WWW/TR/ Visit this site for the latest information on HTML recommendations and working drafts (including CSS1). Also see http://www.w3.org/pub/WWW/MarkUp/Wilbur/ for information on HTML 3.2, the World Wide Web Consortium's new specification for HTML. HTML 3.2 contains such features as tables, applets, text flow around images, superscripts, and subscripts.

The DTD for Netscape is even more troublesome. Netscape Communications has not released a DTD for its extension to HTML. The patient people at HALsoft reverse-engineered a DTD for validation purposes, but as new browser versions are released, there's no guarantee that the DTDs will be updated.

Gerald Oskoboiny's Kinder, Gentler Validator During the brightest days of the HALsoft validator's reign, the two most commonly heard cries among Web developers were "We have to validate" and "Can anybody tell me what this error code means?"

Gerald Oskoboiny, at the University of Alberta, was a champion of HTML Level 3 validation and was acutely aware that the HALsoft validator did not make validation a pleasant experience. He developed his Kinder, Gentler Validator (KGV) to meet the validation needs of the developer community while also providing more intelligible error messages.

The KGV is available at

http://ugweb.cs.ualberta.ca/~gerald/validate/

To run it, just enter the URL of the page to be validated. KGV examines the page and displays any lines that have failed, with convenient arrows pointing to the approximate point of failure. The error codes are in real English, not SGML-ese.

Figure 3.1 is an example of KGV's treatment of the same code that was validated above by the WebTech validator:

Figure 3.1: The Kinder, Gentler Validator gives more informative error messages than the WebTech validator.

Notice that each message contains an explanation link. The additional information in these explanations is useful.

Given the fact that KGV uses the same underlying validation engine as WebTech's program, there's no reason not to use KGV as your primary validation tool.

Six Common Problems that Keep Sites from Validating

There are many reasons that pages won't validate, and you can do something to resolve each of them. The following sections cover the problems in detail.

Netscapeisms Netscape Communications Corporation has elected to introduce new, HTML-like tags and attributes to enhance the appearance of pages when viewed through their browser. The strategy was a good one-in February 1996, BrowserWatch reported that over 90 percent of the visitors to their site used some form of Netscape. (Even after Microsoft offered their competing product, MSIE, for free, Netscape still maintained more than 70 percent of market share.)

There is much to be said for enhancing a site with Netscape tags, but unless the site is validated against the Netscape DTD (which has its own set of problems), the Netscape tags will cause the site to fail validation.

Table 3.2 is a list of some popular Netscape-specific tags. Later in this chapter, a strategy for dealing with these tags is described. A section in Chapter 4, "Netscape Enhancements to HTML," describes how to get the best of both worlds-putting up pages that take advantage of Netscape, while displaying acceptable quality to other browsers that follow the standard more closely.

Table 3.2 Common Netscape Tags and Attributes that Might Be Mistaken for Standard HTML

Tag	Attribute
`<BODY>`	`BGCOLOR TEXT LINK ALINK VLINK`
Multiple <`BODY>` tags `<CENTER>`Table caption with embedded headers (for example, `<TABLE><CAPTION><H2>...</H2></CAPTION>...) <TABLE WIDTH=400> <UL TYPE=Square> <HR SIZE=3 NOSHADE WIDTH=75% ALIGN=Center> <FONT...> <BLINK> <NOBR><FRAME>,<FRAMESET>,<NOFRAME> <EMBED>`	No longer supported by Netscape

Using Quotation Marks A generic HTML tag consists of three parts:

<TAG ATTRIBUTE=value>

You might have no attribute, one attribute, or more than one attribute.

The value of the attribute must be enclosed in quotation marks if the text of the attribute contains any characters except A through Z, a through z, 0 through 9, or a few others such as the period. When in doubt, quote. Thus, format a hypertext link something like this:

<A HREF="http://www.whitehouse.gov">

It is an error to leave off the quotation marks because a forward slash is not permitted unless it is within quotation marks.

It is also a common mistake to forget the final quotation mark:

<A HREF="http://www.whitehouse.gov>

The syntax in this example was accepted by Navigator 1.1, but in Navigator 2.0 and later versions, the text after the link doesn't display. Therefore, a developer who doesn't validate-and who instead checks the code with a browser-would have seen no problem in 1995 when putting up this code and checking it with the then-current Netscape 1.1. By 1996, though, when Netscape 2.0 began shipping, that same developer's pages would break.

Keeping Tags Balanced Most HTML tags come in pairs. For every <H1> there should be an </H1>. For every <EM> there must be an </EM>. It's easy to forget the trailing tag, and even easier to forget the slash in the trailing tag, leaving something like the following:

_<EM>This text is emphasized.<EM>

Occasionally, one also sees mismatched headers like the following:

_<H1>This is the headline.</H2>

Validators catch these problems.

Typos Spelling checkers catch many typographical errors, but desktop spelling checkers don't know about HTML tags, so it's difficult to use them on Web pages. It's possible to save a page as text and then check it.

ON THE WEB

http://www.eece.ksu.edu/~spectre/WebSter/spell.html Use the tool at this site to spell check the copy online.

What can be done, however, about spelling errors inside the HTML itself? Here's an example:

<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINKS="#0000FF" ALINKS="#FF0000" VLINKS="#FF00FF">

The human eye does a pretty good job of reading right over the errors. This tag is wrong-the LINK, ALINK, and VLINK attributes are typed incorrectly. A good browser just ignores anything it doesn't understand (in accordance with the Internet Robustness Principle), so the browser acts as though it sees the following:

<BODY BGCOLOR="#FFFFFF" TEXT="#000000">

Validators report incorrect tags such as these so that the developer can correct them.

Incorrect Nesting Every tag has a permitted context. The structure of an HTML document is shown here:

<HTML>
_<HEAD>
__Various head tags, such as TITLE, BASE, and META
_</HEAD>
_<BODY>
__Various body tags, such as <H1>...</H1>,
______and paragraphs <P>...</P>
_</BODY>
</HTML>

While most developers don't make the mistake of putting paragraphs in the header, some inadvertently do something like the following example.

Suppose a developer writes these three lines on a page:

<P><STRONG>Here is a key point.</STRONG>
<P>This text explains the key point.
<P><EM>Here is another point</EM>

These lines are valid HTML. As the site is developed, the author decides to change the emphasized paragraphs to headings. The developer's intent is that the strongly emphasized paragraph will become an H1; the emphasized paragraph will become an H2. Here is the result:

<H1>Here is a key point.
<P>This text explains the key point.
<H2>Here is another point.</H1>
</H2>

Even the best browser would become confused by this code, but fortunately, a validator catches this error so the developer can clarify the intent.

Forgotten Tags Developers frequently omit "unnecessary" tags. For example, the following code is legal HTML 2.0:

<P>Here is a paragraph.
<P>Here is another.
<P>And here is a third.

Under the now-obsolete HTML 1.0, <P> was a paragraph separator. It was an unpaired tag that typically was interpreted by browsers as a request for a bit of white space. Many pages still are written this way:

Here is a paragraph.<P>
Here is another.<P>
And here is a third.<P>

But starting with HTML 2.0, <P> became a paired tag, with strict usage calling for the formatting shown here:

<P>
Here is a paragraph.
</P>
<P>
Here is another.
</P>
<P>
And here is a third.
</P>

While the new style calls for a bit more typing, and is not required, it serves to mark clearly where paragraphs begin and end. This style helps some coders and serves to clarify things for browsers. Thus, it often is useful to write pages by using strict HTML and to validate them with strict DTDs.

What About Netscape-Specific Tags?

Validation is intended to give some assurance that the code will display correctly in any browser. By definition, browser-specific extensions will display correctly only in one browser. Netscape draws the most attention, of course, because that browser has such a large market share. Netscape Communications has announced that when HTML 3.0 is standardized, Netscape will support the standard. Indeed, many of the tags and attributes in HTML 3.2 originally appeared in Navigator.

Note

Many other browsers, such as Microsoft's Internet Explorer, currently support some or all of the Netscape extensions.

Thus, you may decide it's reasonable to validate against HTML Level 2 Strict, then add enough HTML Level 3 features to give your page the desired appearance. The resulting page should validate successfully against the HTML Level 3 standard.

Finally, if the client wants a particular effect (such as a change in font size) that can be accomplished only by using Netscape, you have to use the Netscape tags and do three things:

Validate against the HTML 3.2 standard. Any failures should be attributable to the Netscape-specific tags and attributes.
Validate against the Netscape DTD, such as it is. If the page fails validation, make sure that it's because the page uses a new feature that's not yet in the DTD. Test the page against multiple versions of Netscape, even if it passes validation. Test the page with other popular browsers to see how they handle the Netscape tags. Most browsers ignore tags that they don't understand, but that doesn't mean the result will look good.
Make sure that whoever is paying for the work understands how the page will look in browsers other than Netscape. Both the client and developer should be satisfied with the results.

If the desired page (as enhanced for Netscape) doesn't look acceptable in other browsers, don't just mark the page "Enhanced for Netscape!" For many reasons, at least five percent of the market does not use Navigator or a Navigator clone such as MSIE. Many of these users use browsers supplied by commercial online services (such as NetCruiser, from NetCom). These users are often among the least knowledgeable when it comes to understanding why Web pages have a certain appearance.

Various estimates place the size of the Web audience at around 30,000,000 people. Putting "Enhanced for Netscape!" on a site turns away over one million potential customers. A better solution is to redesign the page so that it takes advantage of Netscape-specific features, but still looks good in other browsers. Failing that, you might need to prepare more than one version of the page and use META REFRESH or another technique to serve up browser-specific versions of the page. This is a lot of extra work, but it is better than turning away five percent of the potential customers, or having them see shoddy work.

Tip

One of the fastest ways to separate Netscape Navigator and its clones from less sophisticated browsers is to include a line like

<META HTTP-EQUIV='REFRESH' CONTENT='0; /some/url.html'>

in the <HEAD> section. Navigator and other high-end browsers recognize the META REFRESH sequence and immediately call up the page named in the <META> tag. Other browsers ignore the <META> tag, so they display the contents of the original page.

The good news is that most pages can be made to validate under HTML 3.2 and then can be enhanced for Netscape without detracting from their appearance in other browsers. Chapter 4, "Netscape Enhancements to HTML," discusses techniques for preparing such pages.

Validator-Like Tools

WebTech and KGV are formal validators-they report places where a document does not conform to the DTD. A document can be valid HTML, though, and still be poor HTML.

What They Don't Teach You in Validator School

Part of what validators don't catch is content-related. Content problems are caught by copywriters, graphic artists, and human evaluators, as well as review by the client and developer. There are some other problems that can be caught by software, even though they are perfectly legal HTML.

Lack of ALT Tags Here's an example of code that passes validation, but is nonetheless broken:

<IMG SRC="Graphics/someGraphic.gif" HEIGHT=50 WIDTH=100>

The problem here is a missing ALT tag. When users visit this site with Lynx, or with a graphical browser with image loading turned off, they see a browser-specific placeholder. In Navigator, they see a "broken graphic" icon. In Lynx, they see [IMAGE].

By adding the ALT attribute, browsers that cannot display the graphic instead display the ALT text.

<IMG SRC="Graphics/someGraphic.gif" ALT="[Some Graphic]"
HEIGHT=50 WIDTH=100>

Out-of-Sequence Headings It's not an error to skip heading levels, but it's a poor idea. Some search engines look for <H1>, then <H2>, and so on to prepare an outline of the document. Yet, the code shown here is perfectly valid:

<H2>This is not the top level heading</H2>
<P>
Here is some text that is not the top-level heading.
</P>
<H1>This text should be the top level heading, but it is
buried inside the document</H1>
<P>
Here is some more text.
</P>

Some designers skip levels, going from H1 to H3. This technique is a bad idea, too. First, the reason people do this is often to get a specific visual effect, but no two browsers render headers in quite the same way, so this technique is not reliable for that purpose. Second, automated tools (like some robots) that attempt to build meaningful outlines may become confused by missing levels.

There are several software tools available online that can help locate problems like these, including Doctor HTML and Weblint.

Checking Your Pages with Doctor HTML

Once you have ensured that your pages validate against each of the DTDs you have selected, it's time to give your site a more rigorous workout.

One of the best online checkout tools is Doctor HTML, located at http://imagiware.com/RxHTML.cgi. Written by Thomas Tongue and Imagiware, Doctor HTML can run eight different tests on a page. The following list explains the tests in detail.

Document Structure This test looks at pairs of opening and closing tags. It highlights unpaired tags in a table by tag type. The Document Structure Test does not look at forms or tables-those are handled separately. The results of the Document Structure Test are shown in Figure 3.2.

Figure 3.2: Doctor HTML's Document Structure Report provides a quick look at possible tag mismatches.

Table Structure This test looks for matching pairs of table tags, and for stray table tags that appear outside any valid table.
Form Structure The Form Structure Test looks at the syntax of INPUT tags inside forms. The current version of Doctor HTML ignores the SELECT and TEXTAREA elements.
Image Analysis One of the most useful tests is the Image Analysis Test performed against IMG tags. Doctor HTML loads every image on the page, measures its size, determines its dimensions, and gives an estimate of the time it will take to download the image over a 14.4 kbps modem. The program also reports the dimensions (HEIGHT and WIDTH) and the number of colors in each graphic-the factors that determine overall size and download time. Figure 3.3 shows Doctor HTML's Image Analysis Test.
Image Syntax If ALT, HEIGHT, or WIDTH attributes are missing in an IMG tag, the Image Syntax Test notes the problem.
Spelling Check Unlike a spelling checker on the development machine (which sees the HTML tags), Doctor HTML checks the words that you (and your site's visitors) will see on-screen.
Hyperlink Analysis Another useful test, Hyperlink Analysis exercises all links that leave the page. The results of this test are shown in Figure 3.4. Links that take more than ten seconds to return are reported as "timed out." Links that lead to a server error are listed as "failed."

Figure 3.3: Doctor HTML's Image Analysis Test tells the developer which graphics contribute most to download time, and how to fix them.

Figure 3.4: Doctor HTML's Hyperlink Analysis Test shows which links are suspect.

Caution

This test has a difficult time with on-page named anchors such as <A HREF="#more">.

Sometimes a link returns an unusually small message, such as This site has moved. Doctor HTML shows the size of the returned page so that such small messages can be tested manually.

Command Hierarchy Doctor HTML shows an outline of the document based on the HTML tags. This command hierarchy is used to determine whether the document has an unusual structure, such as a missing HEAD or out-of-order headers.
Summary The Doctor HTML summary appears at the bottom and is the default page following the request for testing. Figure 3.5 shows a typical summary report. If anything happens so that the Doctor cannot return all the data, the summary does not appear. From the summary, you can link directly to the relevant portions of the report.

Figure 3.5: Doctor HTML's Summary Report contains a wealth of information about the page.

Checking Your Pages with Weblint

Another online tool is the Perl script Weblint, written by Neil Bowers of Khoral Research. Weblint is distinctive in that it's available online at

http://www.unipress.com/web-lint/

and also can be copied from the Internet to a developer's local machine. The gzipped tar file of Weblint is available from

ftp://ftp.khoral.com/pub/weblint/weblint-1.014.tar.gz

A ZIPped version is available at

ftp://ftp.khoral.com/pub/weblint/weblint.zip

The Weblint home page is

http://www.khoral.com/staff/neilb/weblint.html

Tip

KGV (described earlier) offers an integrated Weblint with a particularly rigorous mode called the "pedantic option." You'll find it worthwhile to use this service.

What Is a Lint? The original C compilers on UNIX let programmers get away with many poor practices. The language developers decided not to try to enforce good style in the compilers. Instead, compiler vendors wrote a lint, a program designed to "pick bits of fluff" from the program under inspection.

Weblint Warning Messages Weblint is capable of performing 24 separate checks of an HTML document. The following list is adapted from the README file of Weblint 1.014, by Neil Bowers.

Weblint can check the document for the following:

Basic structure
Unknown elements and element attributes
Context checks (where a tag must appear within a certain element)
Overlapped elements
A TITLE in the HEAD element
An ALT attribute in each IMG tag
Illegally nested elements
Mismatched tags (for example, <H1>...</H2>)
Unclosed elements (for example, <HEAD>...)
Multiple occurrences of elements that should only appear once
Presence of obsolete elements
Odd number of quotation marks in tag
Proper order of headings
Potentially unclosed tags (for example, <EM>...)
Markup embedded in comments (because this can confuse some browsers)
Use of here as anchor text
Use of tags where attributes are expected
Existence of local anchor targets
Case of tags
A <LINK REV=MADE HREF="mailto:...> in HEAD element
HTML 3 elements such as TABLE, MATH, and FIG
Leading and trailing whitespace in certain container elements (for example, <A ...>)
Optional support for the Java APPLET and PARAM elements
Optional support for Netscape tags

When Weblint is run from the command line, the following combination of checks gives a document the most thorough workout:

_weblint -pedantic -e upper-case, bad-link, require-doctype [filename]

The -pedantic switch turns on all warnings except case, bad-link, and require-doctype.

Note

The documentation says that -pedantic turns on all warnings except case, but that's incorrect.

The -e upper-case switch enables a warning about tags that aren't completely in uppercase. While there's nothing wrong with using lowercase, it's useful to be consistent. If you know that every occurrence of the BODY tag is <BODY> and never <body>, <Body>, or <BoDy>, then you can build automated tools that look at your document without worrying about tags that are in non-standard format.

The -e ..., bad-link switch enables a warning about missing links in the local directory. Consider the following example:

<A HREF="http://www.whitehouse.gov/">The White House</A>
<A HREF="theBrownHouse.html">The Brown House</A>
<A HREF="#myHouse">My House</A>

If you write this, Weblint (with the bad-link warning enabled) checks for the existence of the local file theBrownHouse.html. Links that begin with http:, news:, or mailto: are not checked. Neither are named anchors such as #myHouse.

The -e ..., require-doctype switch enables a warning about a missing <!DOCTYPE ...> tag.

Notice that the -x netscape switch is not included. Leave it off to show exactly which lines hold Netscape-specific tags. Never consider a page done until you're satisfied that you've eliminated as much Netscape-specific code as possible, and that you (and your client) can live with the rest. See Chapter 4, "Netscape Enhancements to HTML," for more specific recommendations.

If we use the Weblint settings in this section, and the sample code we tested earlier in the chapter with the WebTech validator and KGV, Weblint gives us these warning messages:

       line 2: <HEAD> must immediately follow <HTML>
       line 2: outer tags should be <HTML> .. </HTML>.
       line 4: odd number of quotes in element <BODY BACKGROUND=
			   "Graphics/white.gif>.
       line 4: <BODY> must immediately follow </HEAD>
       line 4: <BODY> cannot appear in the HEAD element.
       line 5: <H1> cannot appear in the HEAD element.
       line 6: <P> cannot appear in the HEAD element.
       line 8: <P> cannot appear in the HEAD element.
       line 11: unmatched </HTML> (no matching <HTML> seen).
       line 0: no closing </HEAD> seen for <HEAD> on line 2.
HTML source listing:
      1.<!-- select doctype above... -->
      2.<HEAD>
      3.<TITLE>Test</TITLE>
      4.<BODY BACKGROUND="Graphics/white.gif>
      5.<H1>This is header one</H1>
      6.<P>
      7.This document is about nothing at all.
      8.<P>
      9.But the HTML is not much good!
    10.</BODY>

Because Weblint is a Perl script and is available for download, you should pull it down onto the development machine. Here is an efficient process for delivering high-quality validated pages using a remote server:

Check out all pages from your Configuration Control System and test them against Weblint on the local development machine. Use the -pedantic and -e upper-case, bad-link, require-doctype switches.
Once all the pages in a site are clean according to Weblint, make a final pass at the directory level:
weblint -pedantic -e upper-case, bad-link, require-doctype -x netscape [site-directory-name]Weblint runs recursively through the directory. This check ensures that all sub-directories have a file named index.html (so that no one can browse the directories from outside the site) and serves as a double-check that all files have been linted.

Note

For this step, the -x netscape option is turned on. This option allows Weblint to read Netscape-specific tags without issuing a warning.
Copy the files from the development machine to the online server.
Test each page of the site online with KGV and the integrated Weblint. Make sure that each page is error-free. Figure 3.6 shows the online version of Weblint in action.
Test each page of the site with Doctor HTML. Doctor HTML evaluates a different set of criteria for each page and can show things that neither Weblint nor KGV has caught. Change pages as required so that they pass inspection by Doctor HTML. Return to step 1 or step 2 as required after making the changes.
Once all pages in a site pass all three checks (local Weblint, KGV with integrated Weblint, and Doctor HTML), check them back into the Configuration Control System. Annotate them with the fact that they have fully passed these tests.

Figure 3.6: Weblint is aggressive and picky-just what you want in a lint.

The HTML Source Listing With some online tools, such as KGV, any problematic source line is printed by the tool. With others, such as Weblint, it isn't. The forms interface for Weblint, available through

http://www.ccs.org/validate/

turns on the source listing by default. It's best if you leave it at that setting.

Checking Your Pages with LiveWire

One of the advantages of using LiveWire is that the Site Manager includes integrated tests of every page on the site. These tests are a subset of the tests run by KGV, Weblint, and Doctor HTML, so you can't use Site Manager to replace these tests. In fact, Site Manager's checks are confined to various checks of link integrity-a subset of the tests made by Doctor HTML. But, because the Site Manager runs quickly and locally, it's a nice supplement.

Webmasters who have developed sites before LiveWire know that the single most difficult task a Webmaster faces is keeping the links working. As files are moved, copied, and renamed, one link or another inevitably ends up with the wrong URL. Site Manager offers three tools to help the Webmaster deal with this problem: automated link reorganization, a link validity checker, and the capability to automatically correct case mismatches.

Internal Link Reorganization Suppose you have two pages that are under management with Site Manager: bar.html and baz.html. baz.html contains a link to bar.html, and vice versa. You decide to change the names of the files to something more meaningful, like first.html and last.html. In a conventional development environment, you change the file names and then painstakingly go through the files and change the links. In Site Manager, you have an easier way.

Make sure that bar.html and baz.html are under management. (Look for the red triangle on the file icon in the left pane in Site Manager.) Select the icon for bar.html and choose File, Rename to change the name to first.html. Do the same thing with baz.html, changing its name to last.html. Now examine both files with Netscape Navigator. The links are updated to reflect the new names.

Caution

Site Manager keeps the links up-to-date, but does not change your content. If you have a link that points to toc.html and the link text says "Table of Contents," changing toc.html to index.html changes the link, but the text still reads "Table of Contents."

Site Manager should be the focal point of your development process. Use Site Manager to add, delete, and modify your pages.

Caution

If you change a file outside Site Manager while Site Manager is running, the links are not updated. Try to avoid making changes outside Site Manager.

Checking the Links-Internal Links First Even with the help of Site Manager, some links inevitably break. You can check quickly for these links by using Site Manager's Check Links menu items. Site Manager defines internal links as those within the site. External links go outside the site, even if they link to other sites on the same server.

On most sites, it's a good idea to start the links check by checking internal links. Not only do you have more control over these links, but these are also the links that are most likely to be broken during the early development effort. Internal link verification is also faster than external link verification because internal links link to pages on your hard drive-but external links may have to be exercised over the Internet.

Select the site's development directory in the left pane. Have Site Manager test the internal links by choosing Site, Check Internal Links. Then open the Site Links tab in the right pane. Turn off external links, if necessary, to reduce the clutter. Resize the panes and columns of the Site Links tab so that you can see the information presented. Figure 3.7 shows the resul-ting window.

Figure 3.7: Use the Site Links tab to see the invalid internal links.

To fix the invalid links, select one of the links and choose Edit, Modify Links. In the dialog box that appears, change the link to one that is valid. As you correct broken links, the link disappears from the Site Links tab.

The field at the bottom of the Site Links tab shows which pages contain the invalid link that is selected in the top pane. You can see this information to make more specific changes on a page-by-page basis if the problem is more complex than just a typographic error.

Repairing Links that Have Mismatched Case The most common reason for invalid internal links is mismatched case. One person builds a page and calls it toc.html. The next person includes a link to TOC.html. The link is invalid because TOC.html doesn't exist.

Before LiveWire, Webmasters spent a lot of time on problems like this one. Now, Site Manager can fix these problems quickly. Choose Site, Repair Case Sense Problems-Site Manager puts up a list of links that would work, if only they were the right case. Figure 3.8 shows such a list.

Figure 3.8: Site Manager can quickly repair all links broken because of a case mismatch.

To see which page Site Manager thinks is the proper destination for the link, select the link with the case problem. Site Manager's choice is shown in the field at the bottom of the dialog box. If you want, you can edit that choice. When you're satisfied that Site Manager will do the right thing, click Fix Links.

Checking External Links After all of the internal links are working, you're ready to move on to external links. Checking external links often takes longer because the connection over the network is slower than the hard drive, but usually there are fewer external links than internal ones.

Choose Site, Check External Links to start the checking process. When it completes, open the Site Links tab in the right pane and check the External Links check box. Use the pop-up menu to restrict the view to Invalid links only if you need to reduce the clutter. Figure 3.9 shows a typical list of external links.

Figure 3.9: External links should be checked at least once a month on most sites to be sure that other Webmasters have not moved their pages or "gone dark".

If a more sophisticated fix is needed, use Modify Links from the Edit menu to fix links or edit the page with the invalid link.

Links that Cannot Be Checked You may notice that the status of some links says "Unchecked" or "Never Checked." Unchecked links should be rare-they represent links that have been added since you last selected Site, Check External Links. "Never Checked" links are links to non-Web servers, such as mailto links. The only way to verify these links is to send mail-which Site Manager leaves up to you.

Interactive Pages in Standard HTML

HTML forms are most programmers' first introduction to server-side scripts such as CGI or server-side JavaScript. Matt's Script Archive, online at http://www.worldwidemart.com/scripts/, contains formmail.pl, a CGI script that reads the contents of a form and sends it on to a designated recipient. As the complexity of the form grows, some Webmasters want to split it so that each page of the form depends upon the answers to the page before it. In order to build a multipart form, the concept of "state" must be added to HTTP.

The Hypertext Transfer Protocol, HTTP, the protocol of the Web, is stateless-that is, the server sees each request as a stand-alone transaction. When a user submits page 2 of a multipart form, the server and CGI scripts have no built-in mechanism for associating this user's page 2 with his or her page 1. These state-preserving mechanisms have to be grafted onto HTTP by using any of several techniques, including modified URLs, hidden fields, and Netscape "cookies."

When you write CGI scripts, you must choose one of these mechanisms and laboriously add it. Fortunately, when you use LiveWire (Netscape's application development environment), the Application Manager writes this section for you.

The nature of HTTP is that each request stands alone. If the server receives a series of requests for related files from the same server that's…well, all fine and good, but as far as the server is concerned, it has no reason to suspect that these requests may be from the same user. If you're trying to build, say, a shopping cart application, little things like keeping the right user with the right shopping cart become important. And HTTP provides no support for this task. None.

Prior to Netscape's introduction of LiveWire, all the work of state preservation was up to the CGI programmer.

This section shows how HTTP works and why it's impossible to remember state with HTTP alone. This section also shows some mechanisms that can be grafted onto HTTP to meet the need for state preservation.

An HTTP Review

Anyone who has entered an URL has wondered about the letters "http" and why they're omnipresent on the Web. HTTP is a series of handshakes that are exchanged between a browser, like Netscape Navigator, and the server.

You can find many different servers. CERN, the research center in Switzerland that did the original development of the Web, has one. So does the National Center for Supercomputer Applications (NCSA), the organization that did much of the early work on the graphical portions of the Web. Of course, Netscape sells two second-generation Web servers; one for entry-level use and one for high-volume sites and the Internet. The one thing all Web servers have in common is that they speak HTTP.

The definitive description of HTTP is found at

http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v10-spec-03.html

That document contains a detailed memo from the HTTP Working Group of the Internet Engineering Task Force. The current version, HTTP/1.0, is the standard for how all communication is accomplished over the Web.

Communication on the Internet takes place by using a set of protocols named TCP/IP, which stands for Transmission Control Protocol/Internet Protocol. Think of TCP/IP as being similar to the telephone system and HTTP as a conversation that two people have over the phone.

The Request When a user enters an URL, such as http://www.xyz.com/index.html, TCP/IP on the user's machine talks to the network name servers to find out the IP address of the xyz.com server. TCP/IP then opens a conversation with the machine named www at that domain. TCP defines a set of ports-each of which can provide some service-on a server. By default, the http server (commonly named httpd) is listening on port 80.

The client software (a browser like Netscape Navigator) starts the conversation. To get the file named index.html from www.xyz.com, the browser says the following to the designated port on the designated server:

Get /index.html/1.0

Note

After this line, the browser sends optional headers, followed by a second <CRLF> that causes the server to process the request.

Formally, index.html is an instance of a uniform resource identifier (URI). A uniform resource locator (URL) is a type of URI.

Note

Web specifications include provisions for identifiers to specify a particular document, regardless of where that document is located. Other provisions can enable a browser to recognize that two documents are different versions of the same original-differing in language, perhaps, or in format (for example, one may be plain text, and another might be in Adobe Portable Document Format, PDF). For now, most servers and browsers know about only one type of URI: the URL.

The GET method asks the server to return whatever information is indicated by the URI. If the URI represents a file (for example, index.html), then the contents of the file are returned. If the URI represents a process (such as formmail.cgi), then the server runs the process and sends the output.

Note

This explanation is a bit simplified, since the server has to be configured to run CGI scripts. Because this book concentrates on server-side JavaScript rather than CGI, this section does not describe how to configure a server for CGI.

Most commonly, the URI is expressed in terms relative to the document root of the server. For example, the server can be configured to serve pages starting at

/usr/local/etc/httpd/htdocs

If the user wants a file, for instance, whose full path is

/usr/local/etc/httpd/htdocs/hypertext/WWW/TheProject.html

the client sends the following instruction:

GET /hypertext/WWW/TheProject.html http/1.0

The http/1.0 at the end of the line indicates to the server which version of HTTP the client is able to accept. As the HTTP standard evolves, this field is used to provide backwards compatibility to older browsers.

The Response When the server gets a request, it generates a response. The response a client wants usually looks something like this:

HTTP/1.0 200 OK
Date: Mon, 19 Feb 1996 17:24:19 GMT
Server: Apache/1.0.2
Content-type: text/html
Content-length: 5244
Last-modified: Tue, 06 Feb 1996 19:23:01 GMT
<!DOCTYPE HTML PUBLIC "-//IETF/DTD HTML 3.0//EN">
<HTML>
<HEAD>
.
.
.
</BODY>
</HTML>

The first line is called the status line. It contains three elements, separated by spaces:

The HTTP version
The status code
The reason phrase

When the server is able to find and return an entity associated with the requested URI, the server returns status code 200, which has the reason phrase OK.

The first digit of the status code (the code returned by the Web server in the status line) defines the class of response. Table 3.3 lists the five classes.

Table 3.3 HTTP Response Status Code Classes

Code	Class	Meaning
`1xx`	Informational	These codes are not used, but are reserved for future use.
`2xx`	Success	The request was successfully received, understood, and accepted.
`3xx`	Redirection	Further action must be taken in order to complete the request.
`4xx`	Client error	The request contained bad syntax or could not be fulfilled through no fault of the server.
`5xx`	Server error	The server failed to fulfill an apparently valid request.

Table 3.4 shows the individual values of all status codes presently in use and a typical reason phrase for each code. Reason phrases are associated with status codes to provide a human-readable explanation of the status. These phrases are given as examples in the standard-each site or server can replace these phrases with local equivalents.

Table 3.4 Status Codes and Reason Phrases

Status Code	Reason Phrase
`200`	`OK`
`201`	`Created`
`202`	`Accepted`
`203`	`Partial Information`
`204`	`No Content`
`301`	`Moved Permanently`
`302`	`Moved Temporarily`
`303`	`Method`
`304`	`Not Modified`
`400`	`Bad Request`
`401`	`Unauthorized`
`402`	`Payment Required`
`403`	`Forbidden`
`404`	`Not Found`
`500`	`Internal Server Error`
`501`	`Not Implemented`
`502`	`Server Temporarily Overloaded (Bad Gateway)`
`503`	`Server Unavailable (Gateway Timeout)`

The most common responses are 200, 204, 302, 401, 404, and 500. These and other status codes are discussed more fully in the document located at

http://www.w3.org/hypertext/WWWProtocols/HTTP/HTRESP.html

Status code 200 was described earlier in this section. It means that the request has succeeded and data is coming.

Code 204 means that the document has been found, but it is completely empty. This code is returned if the developer associated an empty file with an URL, perhaps as a placeholder. The most common browser response when code 204 is returned is to leave the current data on-screen and put up an alert dialog box that says Document contains no data or something to that effect.

When a document has been moved, a code 3xx is returned. Code 302 is most commonly used when the URI is a CGI script that outputs something like the following:

_Location: http://www.xyz.com/newPage.html

Typically, this line is followed by two line feeds. A server-side JavaScript programmer initially sends a 302 response when he or she uses the redirect() function.

Most browsers recognize code 302 and look in the Location: line to see which URL to retrieve; they then issue a GET to the new location. Chapter 6, "LiveWire and Server-Side JavaScript," contains details about outputting Location: using redirect().

Status code 401 is seen when the user accesses a protected directory. The response includes a WWW-Authenticate header field with a challenge. Typically, a browser interprets a code 401 by giving the user an opportunity to enter a user name and password.

Status code 402 has some tantalizing possibilities. So far, it has not been implemented in any common browsers or servers. Chapter 18, "Learning More About Netscape ONE Technology," describes Netscape's plans to offer an online digital wallet that enables the user to pay a site owner.

When working on new CGI scripts, the developer frequently sees code 500. The most common explanation of code 500 is that the script has a syntax error or is producing a malformed header. LiveWire applications are much less likely to generate status code 500 because the header is generated by LiveWire itself, and not by the user's application.

Other Requests The preceding examples involve GET, the most common request. A client can also send requests involving HEAD, POST, and "conditional GET."

Note

The HTTP standard also provides for a PUT method. Although PUT is not commonly used on the Web, Netscape uses it to implement the "One-button Publishing" feature of Netscape Navigator Gold.

With one-button publishing, a person using Navigator Gold can send a Web page to the server with a single click of the mouse, without resorting to complex FTP software.

The HEAD request is just like the GET request, except no data is returned. HEAD can be used by special programs called proxy servers to test URIs, either to see whether an updated version is available or to ensure that the URI is available at all. Proxy servers are special server configurations that collect Web pages from standard servers, as though they were a Web client, and serve it back to Web clients, as though they were a conventional server.

See "Livewire and Server-Side JavaScript," p. 157 See "Learning More About Netscape ONE Technology," p. 455

POST is like GET in reverse; POST is used to send data to the server. Developers use POST most frequently when writing CGI scripts and applications to handle form output.

Note

As the LiveWire application developer, you don't see any difference between GET and POST, so it may be difficult for you to choose which method to use. The rule of thumb is this-some platforms put a limit on the number of characters that can be passed in environment variables, which is the method by which GET is implemented. STDIN-the mechanism used by POST-is not subject to such a limit. Unless you know that the number of characters is small, always use POST.

Typically, a POST request brings a code 200 or code 204 response.

Requests Through Proxy Servers Some online services, like America Online, set up machines to be proxy servers. A proxy server sits between the client and the real server. When the client sends a GET request to, say, www.xyz.com, the proxy server checks to see whether it has the requested data stored locally. This local storage is called a cache.

If the requested data is available in the cache, the proxy server determines whether to return the cached data or the version that's on the real server. This decision usually is made on the basis of time-if the proxy server has a recent copy of the data, it can be more efficient to return the cached copy.

To find out whether the data on the real server has been updated, the proxy server can send a conditional GET, like this:

GET index.html http/1.0
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
<CRLF>

If the request would not normally succeed, the response is the same as though the request were a GET. The request is processed as a GET if the date is invalid (including a date that's in the future). The request also is processed as a GET if the data has been modified since the specified date.

If the data has not been modified since the requested date, the server returns status code 304 (Not Modified).

If the proxy server sends a conditional GET, it either gets back data, or it doesn't. If it gets data, it updates the cache copy. If it gets code 304, it sends the cached copy to the user. If it gets any other code, it passes that code back to the client.

Header Fields If-Modified-Since is an example of a header field. Here are the four types of header fields:

General headers
Request headers
Response headers
Entity headers

General headers may be used on a request or on the data. Data can flow both ways. On a GET request, data comes from the server to the client. On a POST request, data goes to the server from the client. In either case, the data is known as the entity.

Here are the three general headers defined in the standard:

Date
MIME-Version
Pragma

By convention, the server should send its current date with the response. By the standard, only one Date header is allowed.

Although HTTP does not conform to the MIME standard, it is useful to report content types by using MIME notation. To avoid confusion, the server may send the MIME Version that it uses. MIME Version 1.0 is the default.

Optional behavior can be described in Pragma directives. HTTP/1.0 defines the nocache directive on request messages to tell proxy servers to ignore their cached copy and GET the entity from the server.

Request header fields are sent by the browser software. Here are the valid request header fields:

Authorization
From
If-Modified-Since
Referer
User-Agent

Referer can be used by LiveWire applications to determine the preceding link. For example, if an application developer announces a client's site to a major search engine, he or she can keep track of the Referer variable to see how often users follow that link to get to the client's site.

User-Agent is sent by the browser to report which software and version the user is running. This field ultimately appears in the request.agentproperty and can be used to return pages with browser-specific code.

Response header fields appear in server responses and can be used by the browser software. Here are the valid response header fields:

Location
Server
WWW-Authenticate

Location is the same "Location" mentioned earlier in this chapter, in the section entitled "The Response." Most browsers expect to see a Location field in a response with a 3xx code, and interpret it by requesting the entity at the new location.

Server gives the name and version number of the server software.

WWW-Authenticate is included in responses with status code 401. The syntax is

_WWW-Authenticate: 1#challenge_

The browser reads the challenge(s)-there must be at least one-and asks the user to respond. Most popular browsers handle this process with a dialog box that prompts the user for a user name and password. Figure 3.10 shows the Netscape FastTrack server and Netscape Navigator challenging a user for authentication information.

Figure 3.10: The server demands a valid user name and password before it allows access to this application.

Entity header fields contain information about the data. Recall that the data is called the entity; information about the contents of the entity body, or metainformation, is sent in entity header fields. Much of this information can be supplied in an HTML document by using the <META> tag. The earlier section of this chapter, "Validating and Checking HTML," shows one use of the <META> tag.

Here are the entity header fields:

Allow
Content-Encoding
Content-Length
Content-Type
Expires
Last-Modified

In addition, new field types can be added to an entity without extending the protocol. It's up to the author to determine what software (if any) will recognize the new type. Client software ignores entity headers that it doesn't recognize.

Note

Netscape uses this mechanism to implement the HTTP-EQUIV field in the <META> tag.

The Expires header is used as another mechanism to keep caches up-to-date. For example, an HTML document might contain the following line:

<META http-equiv="Expires" Contents="Thu, 01 Dec 1994 16:00:00 GMT">

This line means that a proxy server should discard the document at the indicated time and should not send out data after that time without retrieving a fresh copy from the server.

Note

The exact format of the date is specified by the standard, and the date must always be in Greenwich Mean Time (GMT).

Examples of Why Maintaining a State Is Useful

Nothing in HTTP associates the sender of one request with any other request, past or future. But suppose you want to implement a multipart form, like the one shown in Figures 3.11 and 3.12.

Figure 3.11: The user fills in the first page of the Mortgage Advisor form.

Figure 3.12: The script attached to the first page generates Mortgage Options, which are listed on the second page.

Here's another example. The first page, shown in Figure 3.13, is part of a shopping application-the user places items into a shopping cart.

Figure 3.13: The shopper adds items to the shopping cart.

Later, when the shopper reviews the order, the items displayed must match the ones the shopper has been putting in the cart. Figure 3.14 shows the order, presented for review.

Figure 3.14: The shopper reviews the order-state was preserved to be sure the shopper got the same order they started with.

After the user is finished shopping, the user checks out, as shown in Figure 13.15. The system must ensure that the shopping cart follows the user to the checkout page so that the order fulfillment portion of the site knows what to tell the site owner to ship.

Figure 3.15: Finally, the shopper checksout-and expects their same order to be processed and shipped to them.

Clearly, these and other applications will work only if you can find a way to graft state preservation onto the stateless HTTP.

A Range of Methods

Fundamentally, state information is retained in just two places: the client or the server. This section describes the mechanisms available to the LiveWire application developer.

Application Manager enables the user to choose from five techniques for state preservation. These choices appear in the right pane of Application Manager in the field entitled Client Object Maintenance, shown in Figure 3.16.

Figure 3.16: Application Manager affords the developer five techniques for preserving client state.

The client-based choices are:

client-cookie
client-url

The server-based choices are:

server-ip
server-cookie
server-URL

The remainder of this section describes how these options work and identifies when they are appropriate choices.

Client URL To see how Client URL state preservation works, go to Application Manager and select the World application. Follow the Modify link; when the right pane shows the modify frame, change the Client Object Maintenance type to client-url. Figure 3.17 shows this change in progress.

Figure 3.17: Change the Client Object Maintenance type through Application Manager.

Now, choose http://your.server.domain/world/ to run the application. Enter your name in the field and press Enter-and watch the URL at the top of the Window. You sent the browser to http://your.server.domain/world/-the default page is actually hello.html-but the browser you are at http://your.server.domain/world hello.html?NETSCAPE_LIVEWIRE.oldname=null. Run the application again, and the URL changes to http://your.server.domain/world/hello.html?NETSCAPE_LIVEWIRE.oldname=yourName. What is going on here?

Open your editor to the source file for hello.html. (Don't just do a View, Document Source. You need to see what's between the <SERVER> tags.)

The operative line is the one that reads

client.oldname = request.newname;

When you run the application the first time, you have not yet submitted the form, so the properties of the request object that come from the form are null. The assignment statement stuffs that null into a property that the programmer defined on the client object: oldname. When the application finishes with the request, it has to store the property oldname somewhere so that it can reconstruct the client object on the next request.

Where will it store client's properties? Where you told it to-in the URL.

When the programmer writes the source code, he or she shouldn't have to worry about which mechanism you are going to choose for state preservation. So, the programmer tells the application to submit the form to hello.html (in the ACTION attribute of the FORM tag). How did hello.html transform into /hello.html?NETSCAPE_LIVEWIRE.oldname=yourName?

If you have a UNIX machine, you can find out by going to the command line. (The procedure is a bit different on a Windows NT server, but the process is the same.) Change directories to the location of the hello.html source file. Now enter

lwcomp -d -c hello.html | more

For now, it's enough to know that the -dcompiler switch causes the compiler to show the code it produced from the input file. The -c switch tells the compiler not to produce a new output file-just check the code. The pipe to more is useful because the line you're interested in is near the top of the file, and you don't want the output to scroll off the top of the screen.

Look at the lines that write the <FORM...> tag:

write<"\n\n<h3> Enter your name... </h3>\n<form method=\"post\" action=\"");
writeURL("hello.html");
write("\">\n<input type=...

writeURL() is a special server-side JavaScript function that knows enough to look at the current state preservation mechanism before it runs. If the state preservation mechanism is set to client-url, as it is now, writeURL() appends the client information to the URL.

To see this, go back to the browser window that is running World and do a View, Document Source. Here you see the line that LiveWire compiler actually produced in response to the code that contained the writeURL("hello.html"):

<form method="post" action="hello.html?NETSCAPE_LIVEWIRE.oldname="null">

When the server gets the request, it is in the form

POST /hello.html?NETSCAPE_LIVEWIRE.oldname="null" HTTP/1.0

followed by the contents of the field on the form. The server runs the LiveWire application that is associated with the hello.html page-LiveWire pulls off the oldname parameter and attaches it to the client object.

Tip

In most cases, the LiveWire compiler can figure out where the URLs are and substitute writeURL() for write(). If you know that your application is building an URL dynamically, be sure to use writeURL().

You can check to see which function the compiler is using by looking at the compiler's output with the -d switch.

Always use writeURL() for dynamic URLs, even if you plan to use a state-preserving mechanism that doesn't rely on URLs. That way, a Webmaster can safely change Client Object Maintenance to client-url or server-url without having to worry about whether the application will run correctly.

The principal advantage of the client-url method is that it works with all browsers. Most Web sites receive the largest percentage of visits from Netscape browsers, but still find that 10 to 30 percent of their visitors use non-Netscape browsers.

The principal disadvantage of this approach is that, as the number of parameters grows, the URL can become quite long. Some applications have five, ten, or more parameters attached to their client object. Using the client to remember these parameters can take up a fair amount of bandwidth.

Caution

Note that all of the encoding for the URL must be in place before the content is sent back to the client machine. After the page is returned to the client, no more opportunity exists to add or change properties on the client object.

Try to finish the setup of the client object before you begin to write to the client. As a minimum, recall that output to the client is written to a 64K buffer-after you've written 64K, the buffer is sent. (The buffer is sent before that time if your application calls flush().) If you use client-URL encoding, you must finish setting the properties in the client object before the buffer is sent.

Caution

Another drawback of both client- and server-urls is that the client must stay in the application to preserve the information in the URL. If a user is in the middle of the application, and then goes out to another site, his or her URL information can be lost.

Server url Go back to Application Manager and change Client Object Maintenance to server-url. Now go back to the World application and repeat the process of entering your name in the field.

This time, the URL says something like

http://your.server.domain/world/hello.html?NETSCAPE_LIVEWIRE_ID=002822754597

The URL doesn't hold the contents of the client object. Instead, it holds an ID number that points to a bit of shared memory on the server. The client maintains the pointer just like it maintained the data itself when you used client-url. When the user submits the form, LiveWire strips off the ID and uses it to look up the client properties and set up the client object.

The server-url mechanism for preserving state offers the same advantage as client-url-it works with any browser, not just Netscape Navigator. And it consumes far less bandwidth because only the ID number is passed back and forth to the server.

Server IP Another approach that works with all browsers-though not with all service providers-is the server-ip mechanism.

If you rerun the experiment from the last two sections with Client Object Maintenance set to server-ip, you won't see anything unusual appearing in the URL. Instead, the server keeps track of which user is which by examining the client's Internet Protocol (IP) address. This approach works well as long as each client has a unique fixed IP address. On the Internet, though, that assumption often breaks down.

Note

Are We Running out of IP Addresses?

On the Internet, many people connect through service providers that dynamically allocate IP addresses. IP addresses are a 32-bit number, usually written as four 8-bit numbers in dotted-decimal form, like this: 207.20.8.1. An eight-bit number can express 256 different values, so there are only 2564 unique IP addresses, which works out to a theoretical maximum of 4,294,967,296. This number is much higher than the practical limit- some numbers are reserved for special purposes, and the numbers are broken into five classes, depending on how many computers a company is placing on the Net.

By some estimates, nearly 10 million computers are on the Internet today on a full-time basis. However, the rate of growth is fast enough and the practical limit low enough that valid concern exists about running out of IP addresses. All of the huge class A addresses have been allocated, and most of the class Bs are in use, so many large companies are making do with multiple class C addresses.

One stop-gap measure is for an ISP to dynamically allocate its assigned addresses to users as they connect. Suppose an ISP has around 2,000 subscribers, but at any given moment only about 200 of them are online. Instead of tying up 2,000 IP addresses, the ISP may request a block of 255-called a Class C address-and give each user an IP address when they connect. As long as the number of users connected is never more than 255, they can service all of their subscribers.

Some systems use CIDR (Classless Interdomain Routing) or DHCP (Dynamic Host Configuration Protocol) to help with this problem. Others are holding out for a next generation of IP, called Ipng. When IPng becomes a reality, perhaps all machines will have a unique address. Until that day, the server-ip mechanism is reliable only in a controlled environment, such as an intranet.

Intranets often have most of their machines permanently online. A large company may have a single Class B license, with over 65,000 unique addresses. For most intranets, server-ip can offer all of the advantages of the URL-based methods, yet consumes no extra bandwidth at all.

Of course, many intranets are large enough for applications to be accessed through in-house proxy servers. This design can break the server-ip method as well because each request comes to the application from the IP address of the proxy server.

In short, feel free to use server-ip if you can, but be aware of the restrictions.

Cookies

The remaining methods are based on the Netscape cookie-a browser-specific mechanism introduced by Netscape and now used by about a dozen browsers. This section describes how cookies work in general and shows how they are used by LiveWire.

Setting Up a Cookie

To start using a cookie, a server application must ask the user's browser to set up a cookie. The server sends a header like this:

Set-Cookie: NAME=VALUE; expires=DATE; path=PATH; domain=DOMAIN_NAME; secure

If the server application is a CGI script, the programmer has to manage each of these fields directly. If the application is a LiveWire application, the installer just has to set Client Object Maintenance to one of the cookie mechanisms. Nevertheless, understanding each field enables you to know what LiveWire can do for you.

Tip

The latest specification for Netscape cookies is available online at http://www.netscape.com/newsref/std/cookie_spec.html.

NAME The application sets the name to something meaningful-LiveWire always uses NETSCAPE_LIVEWIRE.propName=propValue; to avoid conflicts with other applications such as CGI scripts. In a multipage survey for the XYZ company, NAME may be set to PRODUCT=BaffleBlaster. NAME is the only required field in Set-Cookie.

expires After a server asks the browser to set up a cookie, that cookie remains on the user's system until the cookie expires. When the user visits the site again, the browser presents its cookie, and the application can read the information stored in it. For some applications, a cookie may be useful for an indefinite period. For others, the cookie has a definite lifetime. In the example of the survey, the cookie is not useful after the survey ends. Using the standard HTTP date notation in Greenwich Mean Time (GMT), an application can force the cookie to expire by sending an expiration date, as shown throughout this chapter. Here is an example:

Set-Cookie: NAME=XYZSurvey12; expires=Mon, 03-Jun-96 00:00:00 GMT;

After the expiration date is reached, the cookie is no longer stored or given out. If no expiration date is given, the cookie expires when the user exits the browser. LiveWire applications set up a default expiration of the client object of ten minutes but leave the expires field empty.

Unexpired cookies are deleted from the client's disk if certain internal limits are hit. For example, Navigator has a limit of 300 cookies, with no more than 20 cookies per path and domain. The maximum size of one cookie is 4K.

domain Each cookie has a domain for which it is valid. When an application asks a browser to set up or send its cookie, the browser compares the URL of the server with the domain attributes of its cookies. The browser looks for a tail match. That is, if the cookie domain is xyz.com, the domain matches www.xyz.com, or pluto.xyz.com, or mercury.xyz.com. If the domain is one of the seven special top-level domains, the browser expects at least two periods in the matching domain. If the domain is not one of the special seven, there must be at least three periods. The seven special domains are COM, EDU, NET, ORG, GOV, MIL, and INT. Thus, www.xyz.com matches xyz.com, but atl.ga.us does not match ga.us.

If no domain is specified, the browser uses the name of the server as the default domain name. LiveWire does not set the domain.

Order is important in Set-Cookie. If you set up your own cookies (as CGI scripts do), do not put the domain before the name, or the browser becomes confused.

path If the server domain's tail matches a cookie's domain attribute, the browser performs a path match. The purpose of path-matching is to allow multiple cookies per server. For example, a user visiting www.xyz.com may take a survey at http://www.xyz.com/survey/ and get a cookie named XYZSurvey12. That user may also report a tech support problem at http://www.xyz.com/techSupport/ and get a cookie called XYZTechSupport. Each of these cookies should set the path so that the appropriate cookie is retrieved later.

Tip

Note that, because of a defect in Netscape Navigator 1.1 and earlier, cookies that have an expires attribute must have their path explicitly set to "/" in order for the cookie to be saved correctly. As the old versions of Netscape disappear from the Internet, this fact will become less significant.

Paths match from the top down. A cookie with path /techSupport matches a request on the same domain from /techSupport/wordProcessingProducts/.

By default, the path attribute is set to the path of the URL that responded with the Set-Cookie request. For example, when you access the World application at http://your.server.com/world/, LiveWire sets the path to /world.

secure A cookie is marked secure by putting the word secure at the end of the request. A secure cookie is sent only to a server offering HTTPS (HTTP over SSL).

By default, cookies are sent in the clear over nonsecure channels. The current version of LiveWire does not use the secure field.

Making Cookies Visible Cookies are stored in a disk file on the client computer. For example, Netscape Navigator stores cookies on a Windows computer in a file called cookies.txt. On a UNIX machine, Navigator uses the file name cookies. On the Mac, the file is called MagicCookie. These files are simple ASCII text-you can examine them with a text editor. Don't change them, though, or you can confuse the applications that wrote them.

Most LiveWire application cookies don't make it into the cookies file, however, because they are set to expire when the user quits the browser. To see LiveWire application cookies, you have to pretend to be the browser.

Note

As an application programmer, you can set the client to expire a given number of seconds after the last request. The default is 600 seconds, or 10 minutes.

If an application calls the client.expiration() method and uses one of the cookie-based Client state maintenance mechanisms, the browser saves the cookie to the hard drive.

An Example with Cookies

Set Client Object Maintenance for the World application to client-cookie. Use telnet to log into your server.

By default, Web servers listen to port 80. If the URL for the World application is http://your.server.domain/world/,connect to port 80. If the URL looks like http://your.server.domain:somePort/world/, connect to the indicated port.

After you are connected, send

HEAD /world/hello.html http/1.0

Press the Enter key twice after typing that line-once to end the line, and once to tell the server that you've sent all the header lines (in this case, none). Note that the HEAD method is used because you don't want to see all the lines of the page-just the header.

Your server responds with something like this:

HTTP/1.0 200 OK
Server: Netscape-FastTrack/2.0a
Date: Sat, 29 Jun 1996 10:52:32 GMT
Set-cookie: NETSCAPE_LIVEWIRE.number=0; path=/world
Set-cookie: NETSCAPE_LIVEWIRE.oldname=null; path=/world
Content-type: text/html

You'll recognize many of these header lines from the earlier discussion of HTTP. The Set-cookie lines tell the browser to remember this information and send it back with subsequent requests.

Caution

Like the URL mechanism, cookies must be sent to the client before the page contents. Try to set up the client object's properties before sending content. You must set up the client's properties before the buffer flushes at 64K, or you'll miss your chance to put the properties in the cookie.

An Example with Short Cookies

Like client-URLs, client-cookies can start consuming a fair amount of bandwidth. More importantly, large numbers of client properties can overflow the browser cookie table. Recall that Navigator has a limit of 300 cookies, with no more than 20 cookies per path and domain. The maximum size of one cookie is 4K. If your application requires more client properties, consider switching to short cookies, called server-cookies in Application Manager.

Using the short cookie technique, the cookie contains an ID number that identifies where on the server the data may be found. For example, a shopping cart script might use a cookie on the user's machine to store the fact that this shopper is working on order 142. When the shopper connects to the server, the server looks up its record of order 142 to find out what items are on the order.

Change Client Object Maintenance for the World application once more-to server cookies. Now go back to telnet and again make a HEAD request to World. This time you see:

HTTP/1.0 200 OK
Server: Netscape-FastTrack/2.0a
Date: Sat, 29 Jun 1996 19:03:44 GMT
Set-cookie: NETSCAPE_LIVEWIRE_ID=002823513984; path=/world; 
expires Sat, 29 Jun 1996 19:13:44 GMT
Content-type: text/html

Note first that the short cookie is an ID number-it works the same way the ID did in the server URL. Note, too, the expiration date. A user can get this cookie, exit the browser, and then restart the browser, and still join his or her old client object.

On intranets, where all of the browsers are cookie-aware, server-cookies have a lot of advantages. Because they write only one cookie (with the ID) to the browser, the network overhead is negligible, and the browser's cookie table is unlikely to overflow.

On the Internet, however, Webmasters still have to deal with the possibility of getting visits from browsers that don't know about cookies. These sites may be best served by a mechanism like server-URL that is browser independent.

Tip

After you select a state-preserving mechanism that you will use most of the time, go back to Application Manager, choose the Config link, and select that mechanism as the default.

Developing Client-Server Applications

In general, servers are more powerful machines than the desktop clients. Often, you can take advantage of this fact by distributing processing between the two machines and piggybacking information for the client onto the client's state-preservation mechanism.

Here's an example of how that works: Suppose you are using client cookies as your state preservation mechanism. Then the client-side JavaScript object document.cookie contains cookies whose names begin with NETSCAPE_LIVEWIRE. Listing 3.2 shows a function that gives the client access to the cookies. Listing 3.3 shows a function to set or change the cookie.

Listing 3.2 -Read a Netscape Cookie from the Client Script

function getCookie(Name) {
  var search = "NETSAPE_LIVEWIRE." + Name + "="
  var RetStr = ""
  var offset = 0
  var end    = 0
  if (document.cookie.length > 0) {
    offset = document.cookie.indexOf(search)
    if (offset != -1) {
        offset += search.length
        end = document.cookie.indexOf(";", offset)
        if (end == -1)
          end = document.cookie.length
    RetStr = decode(document.cookie.substring(offset, end))
    }
  }
  return (RetStr);
}

Listing 3.3 -Set a Netscape Cookie from the Client Script

function setCookie(Name, Value, Expire) {
  document.cookie = "NETSCAPE_LIVEWIRE." + Name + "="
  + encode(Value)
  + ((Expire == null) ? "" : ("; expires = Expire.toGMTString()))
}

Listing 3.4 shows how to use these two functions.

Listing 3.4 -Use a Netscape Cookie from the Client Script

var Kill = new Date()
Kill.setDate(Kill.getDate() + 7)
var value = getCookie("answer")
if (value == "")
  setCookie("answer", "42", Kill)
else
  document.write("The answer is ", value)

What's Next for HTML?

For several months, the move from HTML 2.0 to HTML 3.0 languished. The draft version of the HTML 3.0 standard was allowed to expire while working groups debated various aspects of the standard. Finally, the World Wide Web Consortium has announced a new specification (HTML 3.2) developed in cooperation with the leading browser vendors (including Netscape, Microsoft, and Sun).

What's New in HTML 3.2?

HTML 3.2 includes several new features that have been part of the day-to-day HTML world for several months, including tables, applets, and the <!DOCTYPE...> tag.

ON THE WEB

http://www.w3.org/pub/WWW/MarkUp/Wilbur/ This site contains an overview of the new HTML 3.2 specification, including links to the "Features at a Glance" page and the working draft of the specification.

ON THE WEB

http://www.w3.org/pub/WWW/TR/WD-html32.html Go directly to the latest draft of the HTML 3.2 specification. The HTML 3.2 standard is a "work in progress," but has advanced sufficiently so that many browser vendors are using it to guide their development efforts.

Hot Topics for the Next Generation

HTML 3.2 is by no means the last word in HTML standardization. Indeed, it is only a working draft, and discussions continue about exactly which features will appear in the final version. Look for new developments in scripting, forms, frames, and "meta-math"-a proposal for an extensible notation for math that can be processed by symbolic math systems.

ON THE WEB

http://www.w3.org/pub/WWW/TR/ This site contains links to technical reports and publications of the World Wide Web Consortium. Use this site to follow the working drafts of various proposed HTML features.

Participating in the Standardization Process

Membership in the World Wide Web Consortium's HTML working group is open to all interested parties. You can join the HTML Working Group by sending a subscription request to html-wg-request@w3.org. A subscription request should contain the word subscribe in the Subject field of the message. (If you want to subscribe under a different address, put that address in the Reply-To field of the message.) You can also get help about the list, or information about the archives, by putting help or archive help in the Subject field instead of subscribe.

ON THE WEB

http://www.w3.org/pub/WWW/MarkUp/HTML-WG/ This site contains information about the HTML Working Group of the Internet Engineering Task Force. Visit here to learn more about participating in the ongoing maintenance of the HTML standard.

ON THE WEB

http://www.w3.org/pub/WWW/MarkUp/Activity/ Visit this site to see the World Wide Web Consortium's statement of direction concerning HTML. The index page here includes links to practical advice on which HTML tags can be considered "standard."

In the SDK…

Netscape has publicly stated its commitment to support standard HTML in Navigator. Netscape participates in the standardization process; many of the tags and attributes in HTML 3.2 appeared first as a Netscape extension.

To learn more about Netscape's view of standard HTML, and about HTML in general, download the HTML Reference from the Netscape ONE SDK site.

ON THE WEB

http://developer.netscape.com/library/documentation/htmlguid/index.htm Reach this online guide through the Netscape ONE SDK site. It describes a range of HTML features, with emphasis on the elements added by Netscape.

Chapter 3

Standard HTML

CONTENTS

Navigator Is Based on Standard HTML

Bad HTML Breaks Browsers (Even Navigator)

How to Validate Your Web Pages

Six Common Problems that Keep Sites from Validating

What About Netscape-Specific Tags?

What They Don't Teach You in Validator School

Checking Your Pages with Doctor HTML

Checking Your Pages with Weblint

Checking Your Pages with LiveWire

An HTTP Review

Examples of Why Maintaining a State Is Useful

A Range of Methods

Setting Up a Cookie

An Example with Cookies

An Example with Short Cookies

What's New in HTML 3.2?

Hot Topics for the Next Generation

Participating in the Standardization Process