To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98
Part III XML
CHAPTER 11 Introduction to XML
by Eric Ladd
- In this chapter
- Why XML? 306
- XML Overview 310
- Linking with XML 316
- Using Style Sheets with XML 321
- The XSL Requirements Summary 322
- Applications 322
- XML Software 323
- References 325
Why XML?
HTML is a fairly simple languagesimple enough to have made Web publishing accessible to many people. Its rules are also straightforward enough that scores of programmers have written HTML editing tools that enable content publishers to prepare a document without knowing any HTML. This has opened up the Web to even more people. Indeed, HTMLs ease of use is probably one of the biggest reasons for the explosive growth of the Web.
HTML, however, is not without its problems. For one thing, it is too restrictive. You probably know that HTML is an application of the Standard Generalized Markup Language (SGML), restricted to a certain set of rules. After you start using SGML in a specific way, you sacrifice much of the flexibility you get by using less constrained SGML. This means that you are less likely to be able to describe more complex documents with HTML or with any restricted form of SGML.
Yet another issue with HTML is that, as it evolved, its tags became more focused on describing how content should be presented rather than on what the content was. You read in Chapter 9, Style Sheets, how the idea of a style sheet helps to separate the nature of content from its presentation. Style sheets are a step in the right direction, but they have only recently been adopted, and it will take some time before HTML becomes completely free of tags that describe presentation.
Whats the solution to the problem with HTML then? The answer is the eXtensible Markup Language (XML). Because XML stays focused on content description, you dont run the risk of having any XML markup specifying how to present the content. Additionally, XML is highly extensible, meaning that it is flexible enough to handle a simple document, such as a home page, or a huge document, such as War and Peace. This chapter introduces you to XML, shows you how to use it, and explains why it is so important to the future of Web publishing.
To better understand why XML is the next wave in Web content markup, it is helpful to consider the alternatives and to see how they fall short of meeting the anticipated needs of both content providers and consumers. Discounting XML for the moment, only two options exist for marking up Web content: HTML and SGML.
Problems with HTML
The remarks in the previous paragraphs hint at some of the weaknesses inherent in HTML. The first of these is that many HTML tags are geared toward describing how content should look on a browser screen instead of saying what the content is (as a document description language should). Consider the many text formatting tags in HTML:
- <B> for boldface
- <I> for italic
- <TT> for fixed-width characters
- <FONT> for changing typeface, type size, and color
- <CENTER> for centering text, images, and other page elements
Each of these tags modifies a presentation-related property of an object on the page, but they give absolutely no indication of the meaning of the object. An indexing program, for example, would have no sense of the significance of the following markup:
<B>Warning! Pressing Ctrl+Alt+Del will restart your machine!</B>
A situation such as the one above is why someone tried to introduce a <NOTE> tag into HTML to handle admonishments. An indexer would have a much easier time understanding something like this:
<NOTE CLASS=WARNING>Warning! Pressing Ctrl+Alt+Del will restart
your machine!<⁄NOTE>
In this case, it is clear from the markup that the text between <NOTE> and <⁄NOTE> is a message encouraging caution.
To HTMLs credit, it does have some tags that indicate the meaning of the text they mark up. The following tags, for example, all convey some sense of meaning:
- <ADDRESS> for email and postal addresses
- <BLOCKQUOTE> for indented, quoted text
- <CITE> for citations
- <DFN> for the defining instance of a term
- <EM> for text to be emphasized
- <KBD> for keyboard input
- <Q> for quoted text
- <STRONG> for strong emphasis
You could easily generate a glossary of key terms from a document marked up with the <DFN> tag, for example. All the program would have to do is strip out all the words found between <DFN> and </DFN> tags and form a list from them. This simple example demonstrates what kind of automation is possible when you have marked text with tags that signify meaning.
NOTE: A number of proposed tags indicate meaning, although many did not find their way into the HTML 4.0 recommendation. These include <AU> for an authors name
<PERSON> for a persons name
With XMLs star on the rise, it is unclear whether these tags will ever become part of the HTML standard.
HTML is not the best at describing what content means, but the problems dont stop there. Another issue is that HTML is not flexible enough to properly mark up the wide variety of documents that people want to publish electronically. The only pieces of a document that HTML can describe are a <HEAD> and a <BODY>. But what about document constructs, such as abstracts, chapters, and bibliographies? Currently, no HTML tags can accommodate these kinds of document divisions.
In response to the idea that HTML is not flexible enough, you may be thinking, Hey, if HTML doesnt do what someone wants it to do, another tag will be introduced soon enough. That is actually another problem with HTML. Browser software companies have introduced scores of new, proprietary tags in an effort to lure users to their products. The World Wide Web Consortium (W3C) has, over time, incorporated many of these tags into the HTML standard, but many tags are still used in some HTML documents that wont be rendered properly on all browsers.
If the issues raised so far are not enough, here is one more browser-related problem: Browsers are too forgiving of bad HTML code. Consider the following HTML:
<HEAD>
<META KEYWORD=bad HTML document>
<BODY BGCOLOR=005MFF>
<H1>An Imperfect HTML Document</H2>
<UL>
<LI>Most browsers will render this document in a readable way.
<LI>An HTML validator would be required to catch the syntax errors.
</UL>
|