home account info subscribe login search My ITKnowledge FAQ/help site map contact us


 
Brief Full
 Advanced
      Search
 Search Tips
To access the contents, click the chapter and section titles.

HTML 4.0 Sourcebook
(Publisher: John Wiley & Sons, Inc.)
Author(s): Ian S. Graham
ISBN: 0471257249
Publication Date: 04/01/98

Bookmark It

Search this book:
 
Previous Table of Contents Next


There is a character set that does support the world’s languages, and it goes by the romantic name ISO 10646. The most important part of ISO 10646 is contained within a 16–bit (65,536–character) subset known as the Basic Multilingual Plane or BMP. As noted in Appendix A on the companion Web site, this subset is equivalent to the Unicode 2.0 character set. The HTML internationalization effort chose this character set as the document character set for HTML. The phrase “document character set” means that the rules for processing HTML documents are composed using this character set and that a document must in some way be convertible into this character set to be properly processed. It further means that HTML character references in HTML (for example, ֘) refer by default to the character at this “position” in the UCS/Unicode character set.

However, this does not mean that a document need be encoded using this character set (although that would be the preferred choice). Documents can still be written in any desired character set, known as a character encoding (in reflection of the fact that a digital document is simply an encoding of the characters as binary data), but any entity references must be defined as part of HTML, while numeric character references must refer to the Unicode character at the indicated numerical position. This is discussed in more detail in Appendix A on the companion Web site.

Communicating Character Encoding

When an HTML document is sent by a server to a browser, the server should indicate the character set encoding for the document being sent. The mechanism for this is to send, with the header that precedes the document, a MIME content–type header of the form

Content–type: text/html; charset=charset_name

where charset_name is the name of the character set in which the document is encoded.

Unfortunately, most browsers do not understand charset strings and assume that the string is part of the MIME type declaration. They consequently think the document is of some unknown type and cannot display it. To circumvent this problem, HTML 4 supports a special META element for indicating a document’s character encoding. An example is:

<META HTTT–EQUIV=“Content–type”
      CONTENT=“text/html; charset=charset_name”>

A browser can search for this header to determine the charset encoding of the document. Of course, this is a bit of a chicken and egg problem, as the browser must first assume a character set (typically ISO Latin–1) in order to start reading the text. This will work provided the document, up to the aforementioned META element, is encoded in bytes that correspond to ASCII characters, since then the text can be understood assuming the standard default encoding. This mechanism is implemented on Internet Explorer 4 and Netscape Navigator 4.

The general algorithm used by a browser to determine the character encoding is as follows:

  Use the character set specified in the header sent by the server.
  If there is no charset value sent by the server, check for META element content specifying a character set.
  If there is no detectable META element, use heuristic algorithms to determine the character set (i.e., guess).

Issues Related to Language and Character Set

Once you have an internationalized document, you confront several important formatting problems not encountered with simple European text. For example, several languages read from right to left and not left to right—and in a truly international document, many languages may appear in the same document, even in the same paragraph! Thus internationalized HTML needs a way of specifying both the language of a particular piece of text and the desired directionality for the layout of the characters.

These features are enabled through the attributes LANG and DIR. LANG specifies the language for a block of text, while DIR specifies the text directionality. Browsers that understand these attributes use LANG to control text formatting and DIR to change the direction of the text layout. There may also be a need to locally override the directionality of a block of text, and this fine control is provided by the BDO or Bi–Directional Override element.

Last, the internationalization efforts introduced a new element, Q, for use with short inline quotations. The purpose is similar to BLOCKQUOTE, except that text inside Q will be inline with the regular text flow and will be surrounded by quotation symbols appropriate to the language. Q, discussed in Chapter 6, is not yet widely supported.

Added Entity References

To support certain special relationships between adjacent characters, HTML 4 adds four new entity references corresponding to four special Unicode characters. These are:


Entity Reference Character Reference (UCS) Description
&zwnj; &#8204; zero width non–joiner
&zwj; &#8205; zero width joiner
&lrm; &#8206; left–to–right mark
&rlm; &#8207; right–to–left mark

The first two entities control cursive joining behavior between adjacent letters: an &zwnj; after a letter means that the cursive joining form of the letter should be used, regardless of the following letter, while an &znj; after a letter means that the cursive non–joining form should be used, again regardless of the letter that follows. The second two entity references specify text directionality in situations where the directionality is not obvious—for example, for a double quotation mark between an Arabic (right–to–left) and a Latin (left–to–right) character. In this case, it is unclear if the mark should belong with the Arabic or Latin letter, whereas if the double quote is surrounded by characters of the same directionality (one of these characters being either &lrm; or &rlm;), then this deadlock is broken, and the quotation mark knows which way to point.

The details of internationalized text are complex, and the preceding discussion has only touched on some of the main issues. For detailed information, you are referred to the HTML 4 specifications and the references quoted therein.

Fonts and Font Embedding

One problem with producing quality typography on the Web is the lack of fonts—an author may design a page using dozens of elegant typefaces, but the user viewing the document only has access to the fonts on his or her machine—and this is usually a small number of rather uninteresting fonts! Ideally, it would be nice if a document could reference font information, so that fonts too could be delivered via the Web and loaded into a document when needed.

Both Microsoft and Netscape recognized this need and—as you might expect—have developed two different and entirely incompatible technologies for accomplishing the job. Netscape, in conjunction with Bitstream, has implemented a technology called TrueDoc. TrueDoc is a technology that lets authors create a Web page using any fonts they desire (for example, PostScript or TrueType). When the document is saved, the software saves it along with TrueDoc–format descriptions of the fonts—these descriptions are stored in compressed, encrypted files known as Portable Font Resources, or PFR filed. The HTML document references these PFR files using style sheet rules or HTML link elements. For example, the CSS font reference might be:

@fontdef  { url(’<http://www.site.com/dir/fontfile.pfr>’ }

while the HTML font reference might be:

<LINK REL=“fontdef” SRC=“<http://www.site.com/dir/fontfile.pfr>”>


Previous Table of Contents Next


Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.