Platinum Edition Using HTML 4, XML, and Java 1.2:Introduction to XML

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

CHAPTER 11
Introduction to XML

by Eric Ladd

In this chapter

Why XML? 306

XML Overview 310

Linking with XML 316

Using Style Sheets with XML 321

The XSL Requirements Summary 322

Applications 322

XML Software 323

References 325

Why XML?

HTML is a fairly simple language—simple enough to have made Web publishing accessible to many people. Its rules are also straightforward enough that scores of programmers have written HTML editing tools that enable content publishers to prepare a document without knowing any HTML. This has opened up the Web to even more people. Indeed, HTML’s ease of use is probably one of the biggest reasons for the explosive growth of the Web.

HTML, however, is not without its problems. For one thing, it is too restrictive. You probably know that HTML is an application of the Standard Generalized Markup Language (SGML), restricted to a certain set of rules. After you start using SGML in a specific way, you sacrifice much of the flexibility you get by using “less constrained” SGML. This means that you are less likely to be able to describe more complex documents with HTML or with any restricted form of SGML.

Yet another issue with HTML is that, as it evolved, its tags became more focused on describing how content should be presented rather than on what the content was. You read in Chapter 9, “Style Sheets,” how the idea of a style sheet helps to separate the nature of content from its presentation. Style sheets are a step in the right direction, but they have only recently been adopted, and it will take some time before HTML becomes completely free of tags that describe presentation.

What’s the solution to the problem with HTML then? The answer is the eXtensible Markup Language (XML). Because XML stays focused on content description, you don’t run the risk of having any XML markup specifying how to present the content. Additionally, XML is highly extensible, meaning that it is flexible enough to handle a simple document, such as a home page, or a huge document, such as War and Peace. This chapter introduces you to XML, shows you how to use it, and explains why it is so important to the future of Web publishing.

To better understand why XML is the next wave in Web content markup, it is helpful to consider the alternatives and to see how they fall short of meeting the anticipated needs of both content providers and consumers. Discounting XML for the moment, only two options exist for marking up Web content: HTML and SGML.

Problems with HTML

The remarks in the previous paragraphs hint at some of the weaknesses inherent in HTML. The first of these is that many HTML tags are geared toward describing how content should look on a browser screen instead of saying what the content is (as a document description language should). Consider the many text formatting tags in HTML:

• for boldface

• for italic

• <TT> for fixed-width characters

• for changing typeface, type size, and color

• <CENTER> for centering text, images, and other page elements

Each of these tags modifies a presentation-related property of an object on the page, but they give absolutely no indication of the meaning of the object. An indexing program, for example, would have no sense of the significance of the following markup:

<B>Warning! Pressing Ctrl+Alt+Del will restart your machine!</B>

A situation such as the one above is why someone tried to introduce a <NOTE> tag into HTML to handle admonishments. An indexer would have a much easier time understanding something like this:

<NOTE CLASS=”WARNING”>Warning! Pressing Ctrl+Alt+Del will restart
your machine!<⁄NOTE>

In this case, it is clear from the markup that the text between <NOTE> and <⁄NOTE> is a message encouraging caution.

To HTML’s credit, it does have some tags that indicate the meaning of the text they mark up. The following tags, for example, all convey some sense of meaning:

• <ADDRESS> for email and postal addresses

• <BLOCKQUOTE> for indented, quoted text

• <CITE> for citations

• <DFN> for the defining instance of a term

• for text to be emphasized

• <KBD> for keyboard input

• <Q> for quoted text

• for strong emphasis

You could easily generate a glossary of key terms from a document marked up with the <DFN> tag, for example. All the program would have to do is strip out all the words found between <DFN> and </DFN> tags and form a list from them. This simple example demonstrates what kind of automation is possible when you have marked text with tags that signify meaning.

NOTE: A number of proposed tags indicate meaning, although many did not find their way into the HTML 4.0 recommendation. These include
• <AU> for an author’s name

• <PERSON> for a person’s name

With XML’s star on the rise, it is unclear whether these tags will ever become part of the HTML standard.

HTML is not the best at describing what content means, but the problems don’t stop there. Another issue is that HTML is not flexible enough to properly mark up the wide variety of documents that people want to publish electronically. The only pieces of a document that HTML can describe are a <HEAD> and a <BODY>. But what about document constructs, such as abstracts, chapters, and bibliographies? Currently, no HTML tags can accommodate these kinds of document divisions.

In response to the idea that HTML is not flexible enough, you may be thinking, “Hey, if HTML doesn’t do what someone wants it to do, another tag will be introduced soon enough.” That is actually another problem with HTML. Browser software companies have introduced scores of new, proprietary tags in an effort to lure users to their products. The World Wide Web Consortium (W3C) has, over time, incorporated many of these tags into the HTML standard, but many tags are still used in some HTML documents that won’t be rendered properly on all browsers.

If the issues raised so far are not enough, here is one more browser-related problem: Browsers are too forgiving of bad HTML code. Consider the following HTML:

<HEAD>
<META KEYWORD=”bad HTML document>
<BODY BGCOLOR=”005MFF”>
<H1>An Imperfect HTML Document</H2>
<UL>
<LI>Most browsers will render this document in a readable way.
<LI>An HTML validator would be required to catch the syntax errors.
</UL>

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.

Part IIIXML

CHAPTER 11Introduction to XML

Why XML?

Problems with HTML

Part III
XML

CHAPTER 11
Introduction to XML