Platinum Edition Using HTML 4, XML, and Java 1.2:XML Characters, Notations, and Entities

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

CHAPTER 15
XML Characters, Notations, and Entities

by Simon North and Jim O'Donnell

In this chapter

Character Data and Character Sets 386

Character Sets 386

Entity Encoding 387

Entities and Entity Sets 387

Notations 389

Entities 391

Binary Entities 392

System Identifiers 393

Public Identifier Resolution 393

Parameter Entities 394

Entity Resolution 395

Getting the Most Out of Entities 396

Character Data and Character Sets

At its most basic level, an XML document consists of a sequence of characters. Internally, computers generally always used 7 bits to store letters and characters in digital form. This representation was standardized (as ISO/IEC 646) as the now familiar ASCII scheme. XML allows you to use more than the standard ASCII set of characters. The range of legal characters, the characters that can appear in an XML document, are those with the following hexadecimal values:

• 09 (the tab character)

• 0D (the carriage return character)

• 0A (the line feed character)

• 20 to D7FF, E000 to FFFD, and 10000 to 10FFFF (the legal text and graphics characters of Unicode and ISO 10646)

Unicode and ISO 10646 are standardized character sets, which will be discussed in the next section.

Character Sets

In the ASCII alphabet, only 128 different 7-bit patterns are possible, so 7-bit ASCII is able to represent only 128 characters. These 128 characters are known as the standard ASCII character set and have been the basis of computing for many years. As computers became more advanced, and as they become more of an international phenomenon, extra characters were needed to cover things such as the accented characters used so much in European countries. The eighth bit was therefore repurposed to give 8-bit character sets, thereby doubling the number of possible characters to 256, and is standardized as the ISO 8859 character set. In fact, many ISO 8859 variants exist, each tailored for a specific language; the version we probably meet most often is 8859/1, which is the character set used for HTML and understood by Web browsers. This character set includes accented characters, drawing shapes, a selection of the most common Greek letters used in science and technology, and various other symbols. The first 128 characters of ISO 8859/1 are the same as ISO 646; therefore, it is backward compatible.

Eight bits are fine for most Western languages but are nearly useless for Asian and Oriental languages. To allow for languages such as Arabic, Chinese, Urdu, and so on, first Unicode (with 16-bit encoding) and then ISO 10646 took the next logical steps to support the use of up to 32-bit patterns to represent characters. These allow more than 2 billion characters to be represented. ISO 10646 provides a standard definition for all the characters found in many European and Asian languages. Unicode is used in Microsoft Windows NT.

Unicode actually includes a number of encoding schemes, named according to the number of bits they need. UCS-2 uses 16 bits (2 bytes), which is identical to Unicode, and UCS-4 uses a full 32 bits (4 bytes). ISO 10646 is even more sophisticated than this. Using mapping schemes (called a UTF for UCS Transformation Format), ISO 10646 allows a variable number of bits to be used. There’s little point in using so many bits if you’re only sending the basic 128 ASCII characters, so ISO 10646 allows you to claim extra bits as you need them.

XML supports two UTF formats, UTF-8 (8-bit to 48-bit encoding) and UTF-16 (up to 32-bit but using a mapping that gives more than a million characters).

Entity Encoding

Every text entity in XML may use a different encoding for its characters. Therefore, you can declare separate text entities or elements to hold sections of an XML document that contain, for example, Chinese or Arabic characters, and assign the 16-bit UCS-2 encoding to these sections. The rest of the document can then use more efficient 8-bit encoding.

By default, the ISO 10646 UTF-8 encoding is assumed. If the text entity uses some other encoding, you must declare what that encoding is at the beginning of the entity:

<?xml encoding=“Encoding.Name”?>

Where Encoding.Name is a character set name consisting of only the Latin alphabetic characters (A to Z and a to z), digits, full stops, hyphens, and underscores. The XML processor has to recognize a number of character sets, most commonly UTF-8 and UTF-16. A more complete list can be found with the XML specification at http://www.w3.org/XML/.

Examples of encoding declarations are

 <?xml version=“1.0” encoding=‘UTF-16’?>
 <?xml version=“1.0” standalone=“yes” encoding=“EUC-JP”?>

The default (UTF-8) encoding is detected by the first four bytes of an XML text entity having the hexadecimal values 3C, 3F, 58, and 4D, which are the first characters of the encoding declaration. If no declaration exists, or if none of the other encoding schemes can be made to fit, the entity is assumed to be in UTF-8.

Entities and Entity Sets

Switching to a different encoding isn’t the only way to represent characters that are not included in the UTF-8 character set. Don’t forget that you can always reference any character by quoting its ISO 10646 character number in a character reference (such as &).

You can also declare an entity that represents the character you need, such as this declaration of the degree sign (?) taken from the ISO 8859-1 character set:

<!ENTITY deg “&#176;”>

You can then reference this entity in an XML document wherever you need it:

<para>The temperature today in the south will be 82 &deg;C.</para>

Not all computer systems and transfer media can handle the advanced character sets that you have learned about. The 7-bit ASCII character set is still the lowest common denominator. Therefore, these kinds of character entity declarations have been around since the early days of SGML and have been collected into so-called entity sets.

These entity sets are included as part of the SGML standard (ISO 8879) and go under the somewhat cryptic names of ISOlat1 (Latin alphabet, accented characters), ISOnum (numeric and special characters), ISOcyr1 (Cyrillic characters used in Russian), and so on. They are really an SGML facility and cannot be used as they are in XML (XML does not allow the use of an SDATA system data notation). However, XML versions of the most important of these entity sets are being made publicly available, as you can see from the XML version of the ISOdia (diacritical marks) entity set shown in the following: