home account info subscribe login search My ITKnowledge FAQ/help site map contact us


 
Brief Full
 Advanced
      Search
 Search Tips
To access the contents, click the chapter and section titles.

HTML 4.0 Sourcebook
(Publisher: John Wiley & Sons, Inc.)
Author(s): Ian S. Graham
ISBN: 0471257249
Publication Date: 04/01/98

Bookmark It

Search this book:
 
Previous Table of Contents Next


Character and Entity References

Even when using Latin-1, it is often difficult (due to keyboard limitations) to type non-ASCII characters, while in some cases an author may wish to use a character that is not represented in the character set being used to type the document (for example, a Greek letter cannot be explicitly typed if a document is being written using the ISO Latin-1 character encoding). For these reasons, the HTML language has mechanisms for representing any character using special sequences of ASCII characters. These mechanisms are called character references, which reference characters using decimal numbers, and entity references, which reference them using symbolic names. For example, the character reference for the character é is é (the semicolon is necessary and terminates the special reference), while the entity reference for this same character is é .

These references are useful with computers such as Macintoshes or PCs running DOS—these operating systems do not use ISO Latin-1 for their internal representation of characters (Microsoft Windows does use ISO Latin-1) and instead use proprietary mappings between binary codes and characters. Fortunately, these proprietary systems are differ from ISO Latin-1 only for the 128 non-ASCII characters, so that restricting yourself to ASCII ensures a valid HTML document, while character and entity references let you include characters from the full ISO Latin-1 character set.

Character References and ISO 10646

For character references to be useful, there must be a universal list that relates a reference of the form é to the character é, independent of the actual encoding used to write a document. This list, defined as part of the HTML specification, is known as the document character set.

The HTML specification defines the document character set to be the 16-bit set known as the Universal Character Set (UCS) portion of ISO 10646 (this is formally equivalent to the Unicode 2.0 character set). This set defines up to 65,536 characters (216=65,536; but not all the possible positions in this set are assigned characters), encompassing the symbols used by most of the world’s languages. In an HTML document, character references refer to the position of the character in the UCS character set. Thus, the reference é refers to the 233rd character in UCS (the character é), while the reference δ refers to the 948th character (the Greek lower-case letter δ). Appendix A on the companion Web site lists some of the common characters and the corresponding character references.

Entity References

The HTML specification also defines a collection of entity references: Symbolic ASCII-character names that also can indirectly reference characters from the document character set. An example entity reference is δ , which references the Greek lowercase letter δ. Entity references are often easier to use than character references, since the entity names are easier to remember than the actual code positions. For example, you can probably guess that the entity reference Δ corresponds to the uppercase Greek letter D, but would have trouble determining the correct decimal code (it’s Δ). The tables in Appendix A on the companion Web site list entity references defined as part of HTML 4. Note, however, that many of these names are not understood by current browsers.

Special Characters in an HTML Document

Certain ASCII characters codes are treated as special in an HTML document. For example, the ampersand character (&) indicates the start of an entity or character reference, the left and right angle brackets (< and >) denotes the markup tags, and the double quotation mark (“) marks the beginning and end of strings within the markup tags. Since an HTML parser interprets these characters as special commands or directives, you cannot use the characters themselves to type in an ampersand, greater-than or less-than sign, or a double quotation mark. If you want these characters to appear as regular text, you must include them as character or entity references. The character and entity references for these special characters are given in Table 6.1.

When a browser interprets an HTML document, it looks for the special character strings and interprets them accordingly. Thus, when it encounters the string

<H1> Heading string </H1>

it interprets the strings inside each pair of angle brackets (H1 and /H1) as markup tags and renders the text lying between these bracket pairs and their tags (Heading string) as a heading. However, when the browser sees the string

&lt;H1&gt; Heading string &lt;/H1&gt;

it interprets &lt; and &gt; as entity references, and displays the characters

<H1> Heading string </H1>

as a string of regular text.


Table 6.1 Special Characters in HTML

Character Character Reference Entity Reference
Left angle bracket (<) &#60; &lt;
Right angle bracket (>) &#62; &gt;
Ampersand sign (&) &#38; &amp;
Double quotation sign (“) &#34; &quot;

Comments in HTML Documents

In HTML documents, comments are surrounded by the special character strings <!-- and --> . The text between <!-- and --> is a comment and should not be displayed by a browser. There can be spaces between the -- and the > that ends a comment, but the string <!-- that starts a comment declaration must have no spaces between the characters. The following is an example of a simple comment:

<!-- This is a comment --  >

Comments can span multiple lines, but cannot nest or overlap. You should also be careful when using comments to hide HTML markup that would otherwise be displayed, as some older browsers will mistakenly use the greater-than sign (>) of a regular HTML markup tag to prematurely end the comment. Also, some older browsers do not properly handle multi-line comments and only hide the first line.

Here are some examples of comments:

<!-- This is a comment --
  -- This is a second comment within the same comment declaration -- >

<!-- This is also a comment
     This comment spans more than one line. Some old browsers improperly
     interpret comments that span multiple lines.
  -- >


NOTE: Full Details of SGML Comments

Formally, a comment consists of a comment declaration (consisting of the start string <! and the end string >) that, in turn, can contain any number of comments. Each comment is a text string surrounded by the strings -- and -- (two adjacent hyphens). Thus the string -- this is a comment -- is a single comment when inside a comment declaration. However, there must be no whitespace between the starting string of the comment declaration (<!) and the start of the first comment, so that all comments must begin with the string <!-- . (Pathologically, you can have empty comments of the form <! > .) Whitespace is allowed after every comment, so that the string -- marking the end of the last comment inside a comment declaration can be separated by whitespace from the > character marking the end of the declaration, for example:

<!-- This is also a comment -- >.



Previous Table of Contents Next


Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.