Platinum Edition Using HTML 4, XML, and Java 1.2:Creating XML Documents

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

Avoiding the Pitfalls

You’ve seen some of the problems that entity references can create when their contents are de-referenced. At worst, they can make a complete mess of your XML code. Of course, these problems can be avoided. One of the best ways to avoid the de-referencing problems is to double escape any markup contained in the replacement text, like this:

<!ENTITY safe “Harry &#38;#38; Fred &amp;amp; Joe”>

Now when the XML processor sees the entity reference &safe; in the XML document

<text>The job was left to &safe; to fix.</text>

the expansion will still leave you with valid code. We’ll look at what happens, step by step, as the XML processor de-references the entity reference:

1. The XML processor sees the entity reference &safe; and looks for the replacement text.

2. Finding “Harry &#38; Fred &amp; Joe”>, the XML processor de-references this to Harry & Fred & Joe.

3. The XML processor inserts the replacement text and the resulting XML code is

<text>The job was left to Harry &#38; Fred &amp; Joe to finish.</text>

4. The XML processor then parses the resulting code, sees the entity reference &, and de-references that to give

<text>The job was left to Harry & Fred & Joe to finish.</text>

As you can see from the examples, you can escape the markup by using either the entity reference form (in the example,&) or the character reference form (&) of the predefined entity.

Synchronous Structures

Other than the problems that I have described, one very important restriction exists on using markup in entities. In the last chapter, you learned that the logical and physical structures in the XML document must be synchronous. At the time, the restriction might not have made too much sense because it can be difficult to imagine an example of when the two structures are not synchronous. However, this is an example of when the two structures can become asynchronous. The logical structure is composed of the elements in the XML document and in the replacement text. The physical structure is composed of the document entity (the root entity of the XML document containing the entity reference) and the internal entity (which is the replacement text). The two objects are discrete physical entities as far as XML is concerned, even though in this case they are actually in the same file.

For the two structures to be synchronous, any element that is inside the replacement text must start and finish inside the replacement text (in other words, inside the entity).

The following would be allowed:

<!ENTITY safe “&#38#60;emph&#62;Harry&#38#60;/emph&#62; and Joe”>
<text>The job was left to &safe; to finish.</text>

because the de-referenced entity reference would yield this:

<text>The job was left to <emph>Harry</emph> and Joe to finish.</text>

The following, however, could create a lot of problems:

<!ENTITY unsafe “”&#38#60;emph&#62;Harry and Joe”>
<text>The job was left to &safe;</emph> to finish.</text>

even though, when the entity reference has been de-referenced, the resulting markup would actually be legal:

<text>The job was left to <emph>Harry and Joe</emph> to finish.</text>

Although we are still talking about internal entities, which are completely within our control, the restriction is really pretty logical. The same de-referencing mechanism applies for external entities as well as internal entities, and bearing in mind that the intention is that XML can be used easily on the Web (one of the design goals), we have absolutely no control over what is contained in external entities. XML’s developers could have made a distinction between internal and external entities, but that would go against two more of XML’s basic design goals—simplicity and clarity.

Where to Declare Entities

You have learned what an internal entity reference looks like, and you’ve seen some of the benefits and drawbacks of using entity references. Before we move on to something else, you still need to learn where to put the entity declarations.

Entity references are normally allowed only in the DTD that accompanies the XML document. The declarations of element structures and entities are in fact the only reason for having a DTD at all. You will learn all about DTDs in detail in the next chapter; for now, all you need to know is illustrated by the following:

<?xml version=”1.0"?>
<!DOCTYPE home.page [
   <!ENTITY shortcut “This is the replacement text.”>
]>
<home.page>
   …
</home.page>

The second line in this listing is a document type declaration. This is the line that will later be used to make the association between the XML document and the DTD that describes its structure. The declaration takes the form:

<!DOCTYPE name external.pointer [ internal.subset ]>

where the external.pointer points to a separate file that contains the so-called external subset of the DTD. Don’t worry too much about this for now; the trick is that you can leave this out and concentrate on the so-called internal subset of the DTD. The declaration you will need then looks like this:

<!DOCTYPE name [ internal.subset ]>

In this internal subset you can declare as many elements, attributes, and entities as you like, without having an external DTD at all.

As you will discover later, you can perform all sorts of other tricks with the internal DTD subset. Anything you put in the internal subset takes precedence over anything in an external subset. This means, for example, that you can declare a default set of global values for a whole suite of XML documents and then override the global values in an individual XML document whenever you want.

Before we leave the subject of DTDs altogether, there is one last thing about the document type declaration that you should get into the habit of doing now, even if it doesn’t make much sense at this point. Although you aren’t using an external DTD yet, if and when you do, the name that you give to the document type must be the same as the name of the root element in the XML document. This is shown in the preceding listing, where the document type name (home.page) is the same as the root element name. This isn’t a requirement when an external DTD isn’t present, but it is still a good practice.

CDATA Sections

You have learned how to escape markup characters by using the predefined entities and character references. It doesn’t take much imagination to realize that replacing every markup character in a piece of text could be a long and tedious process. In addition, cases may occur (such as when you are sending the XML code on for further processing by a different application) when you really want to keep all those characters exactly as they are.

The way to do this is to use a CDATA (character data) section, like this:

<![CDATA[This is the text < 5 lines > that I want
         the &!%# XML processor to leave alone!]]>

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.