Linux Unleashed, Third Edition:HTML Programming Basics

-->

Maintaining HTML

Once you have written a Web document and it is available to the world, your job doesn’t end. Unless your document is a simple text file, you will have links to other documents or Web servers embedded. These links must be verified at regular intervals. Also, the integrity of your Web pages should be checked at intervals, to ensure that the flow of the document from your home page is correct.

There are several utilities available to help you check links and also to scan the Web for other sites or documents you may want to provide a hyperlink to. These utilities tend to go by a number of names, such as robot, spider, or wanderer. They are all programs that move across the Web automatically, creating a list of Web links that you can access. (Spiders are similar to the Archie and Veronica tools for the Internet, although neither of these cover the Web.)

Although they are often thought of as utilities for users only (to get a list of sites to try), spiders and their kin are useful for document authors, too, because they show potentially useful and interesting links. One of the best known spiders is the World Wide Web Worm, or WWWW. WWWW enables you to search for keywords or create a Boolean search and can cover titles, documents, and several other search types (including a search of all known HTML pages).

Note:
A copy of World Wide Web Worm can be obtained from http://www.cs.colorado.edu/home/mcbryan/WWWW.html. WebCrawler is available from http://www.biotech.washington.edu/WebCrawler/WebCrawler.html.

A similarly useful spider is WebCrawler, which is similar to WWWW except it can scan entire documents for matches of any keywords. It displays the result in an ordered list from closest match to least likely match.

A common problem with HTML documents as they age is that links that point to files or servers may no longer exist (because either the locations or the documents have changed). It is therefore good practice to validate the hyperlinks in a document on a regular basis. A popular hyperlink analyzer is HTML_ANALYZER. It examines each hyperlink and the contents of the hyperlink to ensure that they are consistent. HTML_ANALYZER functions by examining a document for all links, and then creating a text file that has a list of the links in it. HTML_ANALYZER uses the text files to compare the actual link content to what it should be.

HTML_ANALYZER actually does three tests: It validates the availability of the documents pointed to by hyperlinks (called validation); it looks for hyperlink contents that occur in the database but are not themselves hyperlinks (called completeness); and it looks for a one-to-one relation between hyperlinks and the contents of the hyperlink (called consistency). Any deviations are listed for the user.

HTML_ANALYZER users should have a good familiarity with HTML, their operating system, and the use of command-line driven analyzers. The tool must be compiled using the make utility prior to execution. There are several directories that must be created prior to running HTML_ANALYZER, and when it runs, it creates several temporary files that are not cleaned up. Therefore, HTML_ANALYZER is not a good utility for a novice.

HTML Programming Basics

HyperText Markup Language (HTML) is quite an easy language to learn and work with, and as new versions have been introduced over the last few years it has become quite powerful, too. We can’t hope to teach you HTML in a single chapter in this book, but we can give you an overview of the language and how to use the basics to produce a simple Web page or two.

If you’ve seen a Web page before, you have seen the results of HTML. HTML is the language used to describe how the Web page will look when you access the site. The server transfers the HTML instructions to your browser, which converts those HTML lines of code into the text, images, and layouts you see on the page. A Web browser is usually used to access HTML code, but there are other tools that can do the same. There are a wide variety of browsers out there, starting with the granddaddy of them all, NCSA’s Mosaic. Netscape’s Navigator is the most widely used browser right now, although Microsoft is making inroads slowly with its Explorer. The browser you use doesn’t matter, as they mostly do the same job—display the HTML code they receive from the server. A browser is almost always acting as a client, requesting information from the server.

The HTML language is based on another language called SGML (Standard Generalized Markup Language), which is used to describe the structure of a document and allow for better migration from one documenting tool to another. HTML does not describe how a page will look; it’s not a page description language like PostScript. Instead, HTML describes the structure of a document. It will indicate which text is a heading, which is the body of the document, and where pictures should go. However, it does not give explicit instructions on how the page will look; that’s up to the browser.

Why use HTML? Primarily because it is a small language, and so can transfer instructions over a network quickly. HTML does have limitations because of its size, but newer versions of the language are expanding the capabilities a little. The other major advantage ofo HTML is one most people don’t think about: It is device independent. It doesn’t matter what machine you run; a Web browser will take the same HTML code and translate it for the platform. The browser is the part that is device dependent. That means you can use HTML to write a Web page and not care what machine is used to read it.

Table of Contents