Ch 52 -- Java Internationalization

Java 1.1 Unleashed

- 52 -
Java Internationalization

by Glenn Vanderburg

IN THIS CHAPTER

About Internationalization
An Overview of Java's Internationalization Features
Locales
Resource Bundles
Manipulating Data
Input and Output
Internationalizing
Graphical Interfaces

As the world grows smaller--because of the continued success of the Internet as well as other developments--it is becoming more important to write software for the world rather than for one just language or culture. Writing internationalized applications has been an extremely challenging task, partly because the problems were not well understood, and partly because good tools and facilities to help with the task have been hard to come by.

Java is intended to be the programming language for the Internet; for it to succeed in that role, it has to provide some of those tools for programmers. It should be easy to write internationalized programs in Java.

Java 1.1 is a big step ahead of other programming languages. Java now has unprecedented support for internationalization: a full array of APIs and facilities to assist programmers in developing for a global market. In this chapter, you learn about the problems of internationalization and how to use Java's internationalization features to solve those problems in your programs.
About Internationalization

Internationalization is a large topic--and a complex one. There are facets to internationalization that have never occurred to most programmers. In fact, the size of the topic is one of the reasons that good internationalization facilities are so rare: Very few programmers or groups of programmers have the understanding necessary to design and build tools and libraries to support the task.

Internationalization is the process of writing a program (or modifying an existing program) so that it is ready to support other languages and cultures. Note that internationalization does not include actually translating messages and other tasks specific to supporting a particular language; that job is called localization. A more concise definition of internationalization is this: preparing a program to be localized.

"But," you ask, "if that's all internationalization is, what's the big deal? It seems that localization is the hard part." Localization certainly is difficult, but internationalization has its own challenges. The typical program, before being internationalized, has messages, prompts, and textual labels scattered throughout the code. That's bad enough, but nonlocalized data isn't the worst problem. Most programs contain hidden assumptions about the meaning and proper format of various constructs, such as numbers, currency values, times, dates, sort order, and capitalization rules.

Programs like that are hard to localize because finding all the locale-dependent data is difficult--and finding the hidden assumptions about language and culture is even more difficult. Furthermore, because all the locale dependencies are intertwined with the application code, localizing such programs frequently involves creating entirely new versions for each new locale. Those versions must be supported and maintained in parallel, which is--to say the least--a support nightmare.

It is far better to internationalize programs first, before localizing them. Internationalized programs can usually be switched easily between any of the supported locales, whether at run time or when the program is shipped to customers. An internationalized program has locale- sensitive data grouped together in just a few parts of the program or in external data files. Additionally, code that depends on locale-specific assumptions is replaced with calls to a few special routines that encapsulate those assumptions.

Thus, internationalization depends on two types of techniques: program structuring techniques and disciplined use of library facilities. In addition, an internationalized program uses three kinds of data:

Internal data. Messages, labels, icons, and other "resources" that can be considered part of the program itself.

System dat Information obtained from the runtime platform, such as date and time information or system properties. (System data is usually a tiny portion of the information manipulated by a program, but it is no less important for purposes of internationalization.)

User dat Text, images, or other information supplied at run time at the direction of the user of the program (this kind of data is also called "documents").

Internationalization isn't an absolute quality. Some programs are more thoroughly internationalized than others. Many programs can support multiple European languages, but are prohibitively difficult to localize for a right-to-left or vertical writing system. Some internationalized programs can be switched between locales when they are run; others require recompilation. Finally, some programs require a particularly strong form of internationalization so that they can actually support multiple languages and locales simultaneously. As an example of such a program, consider a word processor to be used by a linguist. That program would have to allow the use of multiple languages and alphabets in a single document. Such programs are called multilingual (or, in some cases, multilocale) programs. Although this extreme variety of internationalization has its own name, it is still a form of internationalization.
An Overview of Java's Internationalization Features

Most of Java's internationalization facilities fall into the category of library routines that encapsulate rules and assumptions about dealing with different types of locale-sensitive data. There are classes for representing and comparing times, dates, numbers, and currency values. Other classes assist with sorting and classifying text and characters. Additionally, Java provides facilities for formatting and parsing those objects for display, either alone or embedded in other localized text (such as GUI labels or error messages).

Java also provides some help with the program structure issues that are an important part of internationalization. The mechanism for managing internal data related to localization is the ResourceBundle class. This class is versatile enough to support many different ways of storing and representing the data, and the Java 1.1 library contains two simple but useful implementations that are very useful for internationalization.

Underneath nearly every part of Java's internationalization support is the Unicode character set. With the release of Java 1.1, Unicode support in the Java language and its libraries is extremely thorough. Having Unicode as the native character set for strings and characters in the language simplifies the rest of the internationalization task immensely.

All these Java facilities are designed to support dynamic localization, even to the point of supporting multilingual programs. There is no global state that affects the entire program, no "system locale" that must be changed. Instances of the Locale class represent particular locales. All the locale-sensitive methods in the Java libraries can take a Locale object as an argument so that locales can be easily changed from one operation to the next. There are "default locales" that make certain operations easier in programs that use only one locale at a time, but they are implemented so that they do not impact full multilingual support when it is needed.

In describing the goals for the Java internationalization facilities, JavaSoft's design specification document contains this statement:

Java programs should be internationalized by default. This implies that it should be easier than not to write internationalized Java code.

That goal was perhaps a little unrealistic, and it certainly wasn't realized in every respect. It still takes discipline and a little extra work to make proper use of the library facilities and to structure things properly. But it's notable that little or none of the extra work is particularly cumbersome; in some respects, it actually is easier to do it the right way. Certainly, internationalization is easier in Java 1.1 than it has ever been before.
Locales

Locale objects are tokens that represent locales. In Java, locales are defined as geographic, political, or cultural regions. For practical purposes, a locale is usually the combination of a language and a set of cultural rules and assumptions (regardless of the geographical designation). In most respects, although they are represented as objects rather than strings, Locale objects function simply as names: they don't really have any behavior or semantics apart from identifying locales. In fact, there is a standard form for representing a locale name as a string. It is relatively easy to convert between the string representation and a Locale object and back again.

To construct a Locale, you must supply a language name because language is the single-most important characteristic of a locale. You can also specify a country name, which helps to distinguish between different cultural conventions used by speakers of the same language. For example, English speakers in the United States do many things differently than those in the United Kingdom; Brazilians have some customs that would be unfamiliar to people in Portugal, even though Portuguese is the dominant language in both countries. Here are some examples of how to construct Locale objects for various situations:

new Locale("en", "") // Default English/American locale new Locale("en", "GB") // English locale for Great Britain new Locale("fr", "CH") // French locale for Switzerland

The names used for the languages and countries are the two-letter international standard language and country codes. The API documentation for the Locale class contains pointers to Web pages with complete lists. Confusingly, language codes are always two lowercase letters, but country codes are two uppercase letters. The Locale constructors aren't picky about case, but they do convert their parameters to the appropriate case when they are called. When you call the toString() method, the result is the two codes concatenated with an intervening underscore, or just the language code if no country is specified.

There is also a constructor that takes three parameters--all strings--to provide a way to specify additional variant locales, including cultural or linguistic enclaves within a single country and language group. For example, among the locales supported by Java 1.1 are two locales for the country of Norway. Norwegian is the language for both locales, but there is a minor difference: they use different names for some of the days of the week. The default Norwegian locale, Bokmål, has a string representation no_NO_B, but because it is the default, it is the locale you get if you omit the variant token or specify only the language token. The other Norwegian locale, Nynorsk, must be specified explicitly; its string representation is no_NO_NY. Here are some examples:

new Locale("no", "") // Bokmål, no_NO_B new Locale("no", "NO") // Bokmål, no_NO_B new Locale("no", "NO", "B") // Bokmål, no_NO_B new Locale("no", "NO", "NY") // Nynorsk, no_NO_NY

Although a Locale is just a form of name, if it is to actually be useful, it must refer to something that really exists. Many internationalization support classes use the locale name to find resources or rules of various kinds for a particular locale. Java 1.1 includes support for the locales listed in Table 52.1.
Table 52.1. Locales supported in Java 1.1.

Minimal Name Full Name Language Country

ar ar_EG Arabic Egypt

be be_BY Belorussian Belarus

bg bg_BG Bulgarian Bulgaria

ca ca_ES Catalan Spain

cs cs_CZ Czech Czech Republic

da da_DK Danish Denmark

de de_DE German Germany

de_AT de_AT German Austria

de_CH de_CH German Switzerland

el el_GR Greek Greece

en_CA en_CA English Canada

en_GB en_GB English United Kingdom

en_IE en_IE English Ireland

(none) en_US English United States

es es_ES Spanish Spain

et et_EE Estonian Estonia

fi fi_FI Finnish Finland

fr fr_FR French France

fr_BE fr_BE French Belgium

fr_CA fr_CA French Canada

fr_CH fr_CH French Switzerland

hr hr_HR Croatian Croatia

hu hu_HU Hungarian Hungary

is is_IS Icelandic Iceland

it it_IT Italian Italy

it_CH it_CH Italian Switzerland

iw iw_IL Hebrew Israel

ja ja_JP Japanese Japan

ko ko_KR Korean Korea

lt lt_LT Lithuanian Lithuania

lv lv_LV Latvian Latvia

mk mk_MK Macedonian Macedonia

nl nl_NL Dutch Netherlands

nl_BE nl_BE Dutch Belgium

no no_NO_B Norwegian (Bokmål) Norway

no_NO_NY no_NO_NY Norwegian (Nynorsk) Norway

pl pl_PL Polish Poland

pt pt_PT Portuguese Portugal

ro ro_RO Romanian Romania

ru ru_RU Russian Russia

sh sh_SP Serbian (Latin) Serbia

sk sk_SK Slovak Slovakia

sl sl_SI Slovene Slovenia

sq sq_AL Albanian Albania

sr sr_SP Serbian (Cyrillic) Serbia

sv sv_SE Swedish Sweden

tr tr_TR Turkish Turkey

uk uk_UA Ukrainian Ukraine

zh zh_CN Chinese China

zh_TW zh_TW Chinese Taiwan

The Locale class is found in the java.util package.
Resource Bundles

Resource bundles can be used to solve the structuring problem associated with internationalizing internal data. Using resource bundles, locale-sensitive data can be grouped together in one place--or just a few places--so that it can be localized easily.

Resource bundles implement special support for locales. In particular, the java.util.ResourceBundle class provides methods that search for bundles associated with a particular locale. The search process tries several variations, starting with the specified locale, falling back to the default locale, and finally settling for a nonlocalized bundle. The search does not fail unless there simply isn't a resource bundle by the requested name at all. That's convenient, and it is usually what you want: If the program hasn't been localized for a particular locale, it is probably better to continue running without localization than simply to fail.

As an example of how the search process works, assume that your program makes use of a resource bundle called MessageBundle. Also assume that the default locale is fr_CA, and the bundle is requested for the no_NO_NY (a situation that might occur if the user is a Norwegian living in Quebec). Now suppose that the application makes the following call:

ResourceBundle.getBundle("MessageBundle", new Locale("no", "NO", "NY"))

In response, the search tests for the existence of these bundles, in the following order:

1. MessageBundle_no_NO_NY

2. MessageBundle_no_NO

3. MessageBundle_no

4. MessageBundle_fr_CA

5. MessageBundle_fr

6. MessageBundle

In other words, the search starts with the specified locale and looks for progressively less- specific alternatives; if none are found, the search starts over again with the default locale; finally, if none of the bundles associated with the default locale is found, the default bundle is tried. If no bundle is found with any of those names, the MissingResourceException exception is thrown.

Resource bundles can be used for any type of data, and the ResourceBundle class can be extended to support any kind of storage mechanism. That's a useful flexibility for internationalization because any kind of data used by a program may have to be localized, including messages, prompts, labels, icons, sound files, and images.
Manipulating Data

Programmers have developed many tricks for quickly dealing with various kinds of data. But it's surprising how often those tricks incorporate cultural or language assumptions. For example, how does one sort a list of words in a language that uses accents and diacritical marks? The problem usually comes as a surprise to English-speaking programmers who are unaccustomed to using accented letters. An even bigger surprise is the revelation that there is no one answer to the question! Different languages and cultures have different rules about how to alphabetize a, á, and â, for example. There are many such issues to be considered for text, dates and times, currency values, and other kinds of data.

Fortunately, the Java libraries provide assistance for handling dates, times, and textual information in a locale-independent way. The next two sections explain the relevant facilities.
Dates and Times

The world doesn't have as many calendar systems as it does languages, but it has enough to be troublesome for the programmer writing international applications. In addition to the standard Gregorian calendar used by most of the world, Chinese, Hebrew, and Islamic calendars are in wide use.

In Java 1.1, dates are represented by instances of the java.util.Date class. Times are represented that way, too; after all, a time is nothing more than a very precise date.

To represent any date, some calendar system must be chosen. The Date class implements one particular calendar system, in which times are represented as the number of milliseconds since the first instant of January 1, 1970, GMT (Greenwich mean time). The number can be negative, to indicate times before that demarcation point. That may not seem like a particularly useful calendar to you; in fact, it is not meant for human consumption. The purpose of this calendar system is to be a simple, uniform calendar for representing times within the Java library.

As such, Date objects are really useful only for comparing with other Date objects. The class provides comparison methods so that it is easy to tell how one Date relates to another. More complicated operations, such as learning about months, years, and days of the week, are the job of the Calendar classes.

Calendar provides a generalized view of different calendar systems; particular calendar systems are supported by subclasses of Calendar. The primary purpose of Calendar objects is to convert between Date objects and integer values for year, month, day of month, day of week, hour, minute, second, and so on. (The documentation for the various calendar classes calls these values fields.) You can change the time a Calendar object represents by providing a new Date object, or by changing the integer values that represent the portions of the calendrical date specification. For example, given a Date object d1, you can calculate another date one week later by doing the following:

Calendar cal = Calendar.getInstance(); // acquire a localized Calendar object cal.setTime(d1); // set the Calendar's time value cal.set(Calendar.WEEK_OF_MONTH, // set the "week-of-month" field to cal.get(Calendar.WEEK_OF_MONTH) + 1); // the current value plus one Date d2 = cal.getTime(); // now retrieve a new Date object

There are some interesting details in this code fragment, so I'll discuss the code step by step. First, note that I didn't just create a new instance of Calendar with a constructor. Instead, I used the "factory method" getInstance() to get a new Calendar object. The getInstance() method creates an object appropriate for the default locale. There is also a version of the method that takes a Locale object as a parameter and creates an object appropriate for the specified locale. This is a pattern common to most of the internationalization classes discussed in the rest of this chapter.

After creating the Calendar object, I had to set the time it represented, using my d1 object. That's a bit cumbersome; hopefully, some future version of Calendar will allow a Date object to be specified as a parameter to the getInstance() method for automatic initialization.

Next, I advanced the week by 1. First I queried the current week of the month, added 1, and reset the same field. The Calendar class can represent dates in terms of the following different fields, denoted by named constants:

Field Name Description

AM_PM Before or after noon

DATE The day of the month

DAY_OF_MONTH The day of the month (synonym for DATE)

DAY_OF_WEEK The day of the week

DAY_OF_WEEK_IN_MONTH Occurrence of this day of the week in this month (for example, Tuesday the 10th is the second Tuesday of the month)

DAY_OF_YEAR The day of the year

DST_OFFSET Offset from UTC for daylight saving time in this time zone

ERA The era in which this date occurs (for example, A.D. or B.C.)

HOUR Hour in 12-hour clock

HOUR_OF_DAY Hour in 24-hour clock

MILLISECOND Milliseconds within the second

MINUTE The minute of the hour

MONTH The month of the year

SECOND The second within the minute

WEEK_OF_MONTH The week of the month

WEEK_OF_YEAR The week of the year

YEAR The year within the era

ZONE_OFFSET Offset from UTC in this time zone

Any of these fields can be retrieved from a Calendar object with the get() method or be set with the set() method. All the fields are stored as integers; to translate to textual representations, use the DateFormat class described in "Formatted Output and Input," later in this chapter.

After changing the week of the month, I was able to retrieve the new Date object, which represents the moment exactly one week after the original Date. Notice that there are a couple of funny things going on in this step.

The first question you might ask is this: What if the WEEK_OF_MONTH field was already set to 5? By incrementing it, we set the new date to be in the sixth week of the month. But that doesn't make any sense because months don't have six weeks.

The answer is that Calendar objects, by default, are very permissive about the way dates are specified. If you specify a date as being in the seventh week of the twelfth month, the calendar assumes that you mean the second week of the first month of the following year. If you specify a time as 25:00, the resulting time is 1:00 A.M. of the following day. This behavior can be turned off by calling setLenient(false).

Another question may have occurred to you: How does the calendar make sense of the fields after the WEEK_OF_MONTH field is incremented? After all, the calendar also knows about fields representing the day of the year and the day of the month. It seems as though changing just one of the fields would cause a conflict with some of the other fields.

The answer is that Calendar keeps track of which fields are explicitly set and gives them precedence over fields that have been inferred from the time value. Because we set only the WEEK_OF_MONTH field, that value has precedence over other fields that contain contradictory information. In fact, Calendar has rules for choosing between contradictory fields even if they have all been explicitly set, but in our example, those rules aren't required.

Java 1.1 includes one specialized calendar implementation, GregorianCalendar, which implements the standard calendar system used by most western countries.
Time Zones

Calendar can localize its operations based on locales, and it can also understand time zones. Instances of java.util.TimeZone represent time zones and incorporate knowledge about time zones around the world, including offsets from GMT and rules about daylight saving time. Other classes in the Java library in addition to the Calendar class make use of TimeZone, including DateFormat (discussed in "Formatted Output and Input," later in this chapter).

There are several ways to create TimeZone objects. To get the current time zone where the computer is running, use the getDefault() method. If you want an object for a particular zone, call getTimeZone() with a string containing that zone's ID (for example, United States Central Standard Time uses an ID of CST). You can also create a TimeZone using the constructor and set an explicit offset from GMT using the setRawOffset() method.

Most of the time, you don't have to query or manipulate TimeZone objects directly; you can just create the appropriate instances and pass them, as needed, to other objects (such as Calendar) which use the time zone information appropriately. Therefore, you can think of TimeZone objects as similar to Locale objects: they primarily serve as names or tokens.
Text Processing

Textual data presents an entirely different class of problems. As I write this chapter, I am (not surprisingly) using a word processing program. It is instructive to consider the various ways in which such an application uses and manipulates text, and what kinds of problems might be involved where internationalization is concerned.

For one thing, the interface for most word processors makes copious use of lists of names: styles, fonts, cross-reference tags, and so on. Such lists are ordered alphabetically. Sorting localized text is an internationalization problem, not merely because of different alphabets, but also because of different rules used across cultures.

When I use the arrow keys to move around in my document, the cursor moves one letter or line at a time. But by holding down various combinations of modifier keys, I can move (or, for that matter, delete or select text) by other units: words, sentences, or paragraphs. Furthermore, my word processor adjusts line breaks as I type. But Unicode complicates the task of recognizing boundaries between textual units, including acceptable places for line breaking. In addition, Unicode incorporates the concept of combining characters: two 16-bit Unicode characters (usually a base character and an accent mark) that combine to produce the appearance of one character. So even moving letter by letter is related to internationalization!

When I search for text in my document, I can choose whether the search should be sensitive to the difference between uppercase and lowercase letters. The program can also change the capitalization of text automatically, whether at my direction or in the process of formatting headings or entries in the table of contents. The difference between uppercase and lowercase is easy to deal with in ASCII, but the issue is much more complicated in an international character set like Unicode.

Java 1.1 provides features for dealing with all these issues. They are divided into three categories: collation, text boundaries, and character classification.
Collation

The term collation refers to the act of comparison. It can also refer to sorting, but the Java library uses it in the first sense. Java 1.1 does not provide sorting routines, but it does provide the collation facilities--the comparison facilities--required to implement sort routines on text objects.

The primary collation class is called java.text.Collator. As with many of the Java internationalization classes, Collator objects are not created directly using the constructor; instead, they are created using the static Collator.getInstance() and Collator.getInstance(Locale) methods, which create localized versions of Collator.

The basic collation functionality of Collator lies in the compare(String, String) method. Just as similar facilities in other languages such as C, this method returns an integer value: zero if the strings are equal, less than or greater than zero if the first string is, respectively, less than or greater than the second. If you are interested only in whether the strings are equal, the convenience method equals(String, String) returns a boolean value indicating equality.

For one-time comparisons, calls to compare() and equals() are sufficient. If certain strings are to be compared multiple times (as might happen when sorting a list of strings), it is better to use CollationKey objects. The getCollationKey(String) method returns an instance of CollationKey that represents the given string for purposes of comparison within the Collator object's locale. In general, CollationKey objects can be compared more efficiently than String objects can be. Given two CollationKey objects called k1 and k2, you compare them in this way:

k1.compareTo(k2)

You can also retrieve the original source string from a CollationKey, using the getSourceString() method, so that you don't have to keep track of which keys represent which strings.

The way Collator objects decide on the relationship between two strings can be tuned to follow desired conventions. The strength of a comparison refers to how literal it is. For example, an extremely strong comparison considers a and A to be different characters; a weaker comparison might consider a, A, ä, and Å to be all the same. Four strength levels are provided. The precise meaning of the strength values depends on the locale, but here are some common definitions:

PRIMARY: Two characters are considered to be the same if their base letters are the same, regardless of accents, other diacritical marks, or case distinctions. This is the weakest kind of collation.

SECONDARY: Two characters are considered to be the same if the base letter and any diacritical marks are the same, regardless of case distinctions.

TERTIARY: Two characters are considered to be the same only if they have the same base letter, the same diacritical marks, and the same case.

IDENTICAL: Two characters are considered to be the same only if they have the same bit pattern. This is the strongest form of collation.

It may seem that there is no difference between TERTIARY and IDENTICAL, but there is a difference. Unicode may provide more than one way to specify the same letter--for example, the character ë can be specified either as a single 16-bit character or as a combination of e and a combining [dieresis] mark. The two versions are considered equal under a TERTIARY comparison but are different under the rules for IDENTICAL comparisons.

The strength a Collator object uses for comparisons can be set with the setStrength() method; the strength can be queried using getStrength().
Text Divisions and Boundaries

The BreakIterator class provides a locale-independent way to find the boundaries between certain kinds of textual elements, such as words and sentences. BreakIterator is even helpful for finding boundaries between printable characters; again, this is because two 16-bit Unicode characters can combine to produce one printed glyph. (It wouldn't be a good idea to try to put the insertion cursor between a letter and its accent!)

Once again, you don't create instances of BreakIterator with a constructor. There are actually several different static factory methods for creating BreakIterator objects, depending on what kind of textual element you want to learn about:

getCharacterInstance() returns a BreakIterator that can find boundaries between characters

getLineInstance() returns a BreakIterator that can find valid places for line breaks

getSentenceInstance() returns a BreakIterator that can find sentence boundaries

getWordInstance() returns a BreakIterator that can find word boundaries

Each of these methods comes in two varieties: one with a Locale parameter, and one with no parameter (this form uses the default locale).

Once you have a BreakIterator object, how do you use it? First, you must inform the object about the text you want to examine, using the setText(String) method. Then you can move through the text looking at boundaries using the first(), last(), next(), and previous() methods. Each of these methods moves the current position of the BreakIterator to the requested boundary and returns the integer position of that boundary within the text. The next() and previous() methods return a special value, BreakIterator.DONE, if there are no more boundaries in the requested direction. Additionally, the next() method has a variant that takes an integer parameter for moving ahead by multiple boundaries in one step.

Iterating through text with a BreakIterator is a little more complicated than traversing a data structure using an Enumeration object. The reason is that the information you get back from a BreakIterator is usually not the complete element you are interested in; instead, it is the start or end point of that element within the text. You must save that value and then find the other boundary before you can do useful things. This is a case in which it makes sense to declare and use two loop variables in a for statement. For example, here is a code fragment that loops through all the words in a String called textBuffer, printing them as it goes:

BreakIterator words = BreakIterator.getWordInstance(); words.setText(textBuffer); for ( int start = words.first(), end = words.next(); end != BreakIterator.DONE; start = end, end = words.next()) { System.out.println(textBuffer.substring(start,end)); }

Another method, current(), returns the index of the current boundary without changing or moving the current point. The following(int) method returns the first boundary after the specified position in the text.

What do you do if you need some of the BreakIterator facilities but can't afford to store your application's text in a String object? If efficiency or ease of access dictate that you use a text data structure (such as a trie--a specialized text data structure) instead of a string, you should define a CharacterIterator class for that data structure. CharacterIterator is an interface, similar in some ways to the java.util.Enumeration interface. It provides an abstract interface for iterating over the elements in a data structure. Classes that implement CharacterIterator don't have to contain all the data; they just have to know how to read it sequentially from the data structure and how to keep track of the current position. CharacterIterator differs from Enumeration in that it is specialized for character data and allows bidirectional scanning.

BreakIterator actually has two setText() methods: One takes a String parameter, and the other takes a CharacterIterator parameter. Once you build a CharacterIterator for your data structure, you can use the BreakIterator facilities to analyze the text stored there.
Character Classification

Even with all the useful text-handling facilities already described, occasionally a program has to work at the level of individual characters. Operations that were trivial using the ASCII character set (such as converting between uppercase and lowercase) are much more complicated with Unicode.

The java.lang.Character class contains several methods for making various tests on character values and for converting them in some way. You can test for the case of a character, or learn whether it is a whitespace character, for example, and you can convert between uppercase and lowercase and back again.
Input and Output

Manipulating data is all very well and good, but at some point, it comes down to reading that data from somewhere and writing (or displaying) it. Java provides facilities to help with several aspects of internationalized I/O, including Unicode streams and formatting and parsing classes.
Unicode Streams

In Java 1.1, the java.io package contains several I/O stream classes that read and write Unicode data. These classes are called readers and writers, and extend the Reader and Writer classes.

Because the String and char data types in Java represent Unicode data, Unicode I/O streams are the natural and preferred mechanism for text input and output. And it doesn't take any extra work to use them instead of the byte-oriented streams. I won't go into much detail about Unicode streams in this chapter; if you want more details about Unicode I/O streams, see Chapter 12, "The I/O Package."
Formatted Output and Input

The hard part about internationalized input and output is formatting and parsing. Data that is intrinsically textual can be read, manipulated, and written again rather easily, but what about data that is represented as text but must be manipulated in some other form? How do you convert back and forth between the values and their textual representations in a locale- independent way?

That's the job of the Format classes. The java.text package contains the abstract Format class, which describes a generic facility for formatting and parsing textual data representations. Several subclasses of Format perform internationalized handling of data types such as Date and numeric values, plus messages for users.

The interface defined by the Format class consists of four methods: two for formatted output, and two for parsing formatted input. The intent is that strings generated by Format and its subclasses can also be reparsed by those same classes to generate an equivalent object or set of objects.

The primary formatting method is format(Object), which returns a formatted string. The other formatting method is more complicated: format(Object, StringBuffer, FieldPosition) returns the StringBuffer object that is passed as the second parameter, after appending the formatted value to it. Subclasses of Format use the FieldPosition object to communicate information about the formatting process. When the method is called, the FieldPosition parameter contains an integer, the field identifier, which the caller uses to express interest in a particular portion of the formatted string. When the method returns, the FieldPosition object has been updated to contain the beginning and ending positions of that field within the StringBuffer. Such information might be useful for choosing appropriate sizes for GUI elements, and this variation of the format() method is useful for building a formatted string in pieces.

The two parsing methods parallel the formatting methods. ParseObject(String) returns an Object created from the information in the string. ParseObject(String, ParsePosition) also returns an Object representing the parsed value, but the ParsePosition parameter is used to control the current position when parsing a string piece by piece. Instances of ParsePosition contain an integer representing a position within a string. When the method is called, the ParsePosition parameter indicates where in the string the method should begin parsing; when the method returns, the parameter has been updated to point to the first character following the parsed value. Thus, it is ready to pass to the next ParseObject method to parse the next piece of the string.

Once again, Format is an abstract class and doesn't provide implementations of all these methods. It does provide default implementations of the simple, single-parameter versions, which work by calling the more general methods so that you can build a working Format object simply by supplying implementations for only two methods.
Formatting and Parsing Individual Objects

The Java 1.1 library provides specialized subclasses of Format for handling two kinds of locale-sensitive data: dates and numbers. DateFormat and NumberFormat each provide the following facilities:

Static getInstance() methods that create instances specialized for a particular locale or for the default locale

Specialized getInstance() methods, such as NumberFormat.getCurrencyInstance(), for creating instances specialized for certain interpretations of the data

Static constants that identify various parts of a formatted value, for use as the field identifiers in FieldPosition objects (for example, DateFormat.MONTH_FIELD)

Methods for setting various style properties for the formatted values (for example, NumberFormat.setMaximumFractionDigits())

Various specialized versions of format() that take parameters of the appropriate type instead of Object

Specialized parse() methods that return the appropriate type instead of Object (these methods can't be versions of parseObject() because methods can't be overloaded on the return type)

DateFormat can format and parse Date objects as dates, or times, or both. NumberFormat can interpret numbers--that is, values of any of the Java numeric types--as ordinary numbers, currency values, or percentages. When parsing numbers using the parse() method, numbers are returned as instances of java.lang.Number.

NOTE: NumberFormat can even handle the two arbitrary-precision numeric types, java.math.BigInteger and java.math.BigDecimal. However, an instance of one of these types is formatted by first calling its longValue() method and then formatting the result as a long. Therefore, the resulting output isn't accurate unless the actual value is within the range that can be represented as a long. Additionally, NumberFormat does not ever return a BigInteger or BigDecimal object when parsing.

You may be surprised to learn that, even though DateFormat and NumberFormat provide all those extra facilities on top of the interface defined by Format, they don't actually implement the parsing and formatting functions! DateFormat and NumberFormat are themselves abstract classes. You must create instances using getInstance() or one of its variants.

When you call getInstance() on one of these classes, you get a preconfigured instance of DecimalFormat (for numbers) or SimpleDateFormat (for dates). These classes can be configured with rules and patterns to format and parse values according to the conventions of a wide variety of locales. The configuration rules are documented in the API documentation, and it's possible to use these classes directly to handle specialized formatting and parsing needs. Unless you really need something special, though, it's best to just call one of the getInstance() methods on DateFormat or NumberFormat and use the object that is returned.

Why? If DecimalFormat and SimpleDateFormat are so configurable, why aren't they the standard objects? Why the extra level of inheritance, with abstract classes that don't actually provide an implementation of the core functionality? The answer is that, as configurable as they are, DecimalFormat and SimpleDateFormat may not be flexible enough to handle all the locales in the world. Some locales may require entirely new formatting classes to be written if they are to be supported correctly. If you asked for a format object for such a locale, you would actually get an instance of one of the new classes. Keeping the basic interface definition and factory methods in a separate class, independent of the actual functionality, makes it easier to support such atypical locales without having to modify any of the classes in the core library.
Formatting and Parsing Textual Messages

Being able to format and parse dates, times, and numbers is nice, but such items rarely occur in isolation; usually they are embedded in other text. The MessageFormat class is designed for formatting and parsing textual strings that may incorporate other data items. Error messages are a prime example of what MessageFormat is good for, but the facility can be used to prepare any text meant for human conceptions, including GUI elements and printed reports.

In many respects, MessageFormat is similar to C's printf() function. In fact, there is a static method, MessageFormat.format(String, Object[]), which is very similar indeed. One parameter is a format string that functions as a pattern, with embedded format specifiers indicating how the other parameters should be processed and substituted into the pattern string. Unlike printf(), however, the format() method doesn't actually print the formatted message; it just returns it as a String. Because Java doesn't permit methods to take a variable number of parameters, the additional items are passed as an Object array. Also, the syntax for format specifiers is different; among other things, they incorporate the number of the data item to which they refer. This is because localization of text often involves changing not only the words, but the structure of sentences. Thus, the item that occurs first in the English version of a sentence may have to come last in the German version. (Format patterns are usually taken from a localized resource bundle rather than being included directly in the code.)

Format specifiers within patterns are surrounded by curly braces. Here is a simple example:

MessageFormat.format("No entry for {0} in the database", new Object[] {name});

The format specifier {0} indicates that the first element of the array (with index 0) should be substituted into the string at that point. The element should be a String.

What if one of your data items is not a String? One solution is to format it separately (for example, with NumberFormat or DateFormat) and use the resulting String. There's no need to do that, however, because you can include the fact that the element is a number or date in the format specifier. This example includes a number, a string, and a date:

// numAppt is an int, pName is a String, and apptDate is a Date MessageFormat.format("{0,number} appointments for {1} on {2,date,medium}", new Object[] {new Integer(numAppt), pName, apptDate});

There are four possibilities for the type selector after the comma: date, time, number, and choice. The date selector results in a call to DateFormat.getDateInstance() to do the formatting; the time selector results in a call to DateFormat.getTimeInstance(). The number selector by itself uses NumberFormat.getInstance(), but style options can be specified to modify that behavior. The choice selector is explained in the next section.

If the type selector is followed by another comma, then whatever follows (up to the curly brace that ends the format specifier) is a style option. The format specifier for the date in the preceding example includes the style option medium, which is one of the styles for dates and times. The others are short, long, and full, and they correspond to the SHORT, MEDIUM, LONG, and FULL constants that DateFormat provides for selecting styles with the getInstance() methods. Additionally, the style for a date or time format can be a valid configuration pattern understood by the SimpleDateFormat class.

The valid styles for number formats are currency, percent, and integer; these styles modify the way the number is interpreted appropriately. The style can also be a configuration pattern accepted by the DecimalFormat class.

So far, I've discussed only the static format(String, Object[]) method. But instances of MessageFormat are useful, too. In fact, the static method is implemented in terms of a throwaway instance:

public static String format(String pattern, Object[] arguments) { MessageFormat temp = new MessageFormat(pattern); return temp.format(arguments); }

Why might you want to create an instance of MessageFormat? There are several possible reasons. For one thing, it would be a good thing to do if you had to print the same message multiple times with different data values. Because parsing and processing the format pattern is reasonably expensive, it would be a good idea to do so only once if the pattern is going to be reused many times. Another reason is that the static version of the method uses the default locale, whereas instances can be created for specific locales. Therefore, you should never use the static method in a multilingual program. You would also create an instance of MessageFormat if you were using it to parse text rather than to format it (more about that topic later).

The final reason why MessageFormat instances are useful is that they give us more flexibility. With an instance, you can build a pattern programmatically. For example, the preceding example used three different objects; but you could have written it this way:

MessageFormat msg = new MessageFormat("{0} appointments for {1} on {2}"); msg.setFormat(0, NumberFormat.getInstance()); // format 1, for the string, is already set. msg.setFormat(2, DateFormat.getDateInstance(Date.MEDIUM)); msg.format(new Object[] {new Integer(numAppt), pName, apptDate});

That seems like an awful lot of trouble to go through when you could just encode the information in the format string. But it can be useful in some complicated situations--and it is particularly useful if you have to include some data that is not a number, date, time, or string. You can extend the Format class to handle locale-independent formatting of any data type you want, but you can't extend the MessageFormat format specifier syntax to support your new Format classes. You can, however, make use of the new classes by explicitly including them with the setFormat() method, as just shown.

I mentioned earlier that the format specifiers include the ordinal number of the array element to which they refer, so that they can be reordered if necessary during the localization process. It's important to note that the numbers used to identify the specifiers in the setFormat() call are different from the numbers actually included in the format specifiers. The numbers used by the setFormat() method always refer to the specifiers in the order they occur in the pattern string, starting with zero. That's a problem for internationalization, because when the order of the specifiers changes during localization, the numbers used to set the format objects for the message have to change, too. I hope this problem is resolved in a future Java release.

As you can with the other subclasses of Format, you can use MessageFormat for parsing as well as formatting. The parse() methods return arrays of Object, and the patterns are the same as for formatting. The intent is that a message formatted with a pattern can be parsed with the same pattern. Although this logic may fail in some situations (including some uses of ChoiceFormat, described in the next section), it works in general.

There are two other things to know about MessageFormat. There is currently an arbitrary limit of 10 format specifiers for a single pattern. There's no need for a limit at all, and a comment in the source code indicates that this limit may someday be removed. (On the other hand, if you are using format patterns with 83 specifiers, you should probably consider a different strategy.)

TIP: If you have to include a { or } character in a format pattern, you must enclose it within the string in single quotation marks. If you have to include a single quote character, double it. This quoting mechanism is, in my opinion, a little strange and inconsistent with the backslash-quoting mechanism used for the first level of quoting in Java strings. However, there are problems with using that mechanism in this case as well, so it is difficult to say whether another option may have been better. Just remember to be on the lookout for occasions when any of those three characters must appear in a string formatted by MessageFormat.

The ChoiceFormat Class

Occasionally, some part of the text surrounding a value has to change depending on the value itself. There are many examples of this, but the classic example involves singular and plural forms of words. There was a time when computer users accepted messages like There were 1 match(es) found, but today, people expect better.

Java 1.1 provides the ChoiceFormat class for solving this problem. Assuming that the variable numMatches contains the number of matches found, here's one simple way to format the preceding message:

MessageFormat msg = new MessageFormat("There {0} found."); double[] limits = {0, 1, 2}; String[] choices = {"were no matches", "was 1 match", "were {0,number} matches"}; ChoiceFormat = new ChoiceFormat(limits, choices); msg.setFormat(0, choice); msg.format(new Object[] {numMatches})

The ChoiceFormat object is created with two arrays: an array of limits and an array of choices. The two arrays should be the same size. The limits array consists of double values, sorted in ascending order. The choices array consists of String values. ChoiceFormat is invoked to format a number. It compares that number to the numbers in the limits array, and formats the number using the element of the choices array that corresponds to the chosen limit. If there are N entries in each array, and we call the number to be formatted x, the matching element is chosen this way:

1 if x < limit[1]

i if limit[i] <= x < limit[i+1]

N if limit[N] <= x

The selected choice is substituted into the pattern and processed recursively. Note that the third choice in the preceding example--used when numMatches is 2 or greater--has a format specifier embedded within it. That specifier is used to format the number into the message if the third choice is selected.

That's a bit cumbersome, but fortunately there's an easier way. The ChoiceFormat class understands patterns of its own, and you can embed those patterns in the format specifiers for MessageFormat. For example, the previous message can also be specified like this:

MessageFormat.format("There {0,choice,0#were no matches|1#was 1 match|"

+ "2#were {0,number} matches} found.");

ChoiceFormat can also be used to parse strings like this one.
Internationalizing Graphical Interfaces

Java 1.1 provides a new feature for the AWT package that is designed to help with internationalizing graphical interfaces. In Java 1.0, buttons and several other kinds of components displayed their names as a user-visible label. In some cases, the name was useful for determining which button was clicked and was hard-coded into the application, making the label text difficult to localize. Java 1.1 components have an explicit label, distinct from the name, that is used for the user-visible display if it has been set. This separation of values allows the label to be localized while enabling you to use the same component name in all locales.
Summary

In this chapter, you learned about internationalization and all the Java 1.1 features that support this task. Internationalization is primarily a problem of managing program structure so that locale-sensitive data and operations are easy to find, and of using appropriate facilities to manipulate certain kinds of data.

Java's resource bundles help with the structuring issues, and a wide variety of other facilities are included for manipulating, formatting, and parsing locale-sensitive data, including dates, times, numbers, and text, in an internationalized manner.

©Copyright, Macmillan Computer Publishing. All rights reserved.

Java 1.1 Unleashed

- 52 - Java Internationalization

About Internationalization

An Overview of Java's Internationalization Features

Locales

Table 52.1. Locales supported in Java 1.1.

Resource Bundles

Manipulating Data

Dates and Times

Time Zones

Text Processing

Collation

Text Divisions and Boundaries

Character Classification

Input and Output

Unicode Streams

Formatted Output and Input

Formatting and Parsing Individual Objects

Formatting and Parsing Textual Messages

The ChoiceFormat Class

Internationalizing Graphical Interfaces

Summary

- 52 -
Java Internationalization