Ch 15 -- The Text Package

Java 1.1 Unleashed

-15-
The Text Package

by Eric Burke

IN THIS CHAPTER

Formats
Collators
Iterators

This chapter describes the classes in the java.text package, new to the Java Development Kit 1.1. The text package contains a number of classes that allow the programmer to create internationalized and localized programs that do not contain location-dependent code. In my experience, localization is one of the most overlooked aspects of released software. This problem is compounded by the global characteristics of the World Wide Web. Any Web application--based on Java or not--has to be understood by many users who speak many different languages.

When used properly, the classes contained in java.text allow the developer to cleanly and correctly implement a localized program to be used the world over. Table 15.1 shows the classes and interfaces that are part of the java.text package.
Table 15.1. Classes and interfaces available in the text package.

Class Description

BreakIterator Finds boundary locations in text

CharacterIterator Interface for bidirectional text iteration

ChoiceFormat Attaches a format to a range of numbers

CollationElementIterator Walks through each character of an international string

CollationKey Compares strings that are part of a Collator class

Collator Abstract class that provides Unicode text-comparison services

DateFormat Abstract base class for date and time formatting

DateFormatSymbols Encapsulates date and time formatting functionality for changes across languages and countries

DecimalFormat Formats decimal numbers

DecimalFormatSymbols Represents symbols such as decimal separators and grouping separators required by DecimalFormat when formatting numbers

FieldPosition Aligns columns of formatted text

Format Abstract base class for all formats

MessageFormat Formats localizable concatenated messages

NumberFormat Abstract base class for all number formats

ParsePosition Records the parsing position for formatted strings

SimpleDateFormat Formats and parses dates in a localized way

StringCharacterIterator Implements the CharacterIterator interface for strings

RuleBasedCollator Simple Collator implementation

SimpleDateFormat Concrete class for parsing and formatting dates

BreakIterator Iterates through word and line boundaries in a stringParseException; thrown when an error occurs while parsing or formatting

Formats

Formats provide a way for the programmer to easily handle the formatting of text, numbers, dates, times, and so on, in a localized way. For example, in the United States, the number 4.00 written as a monetary amount is $4.00, but in Germany, it is DM4,00. By using classes derived from java.text.Format, you are spared the details of how a particular locale writes its numbers or strings for any representation such as money, time, and so on. Instead, you can concentrate on writing the application at hand. The JDK 1.1 provides classes you can use to format numbers, dates and times, and text messages. Formats rely heavily on the use of java.util.Locale, so let's take a look at that before continuing. Table 15.2 lists the important methods supported by the Format class.
Table 15.2. Methods in the Format class.

Method Description

format() Formats the given object

parseObject() Parses the given string using the current format

The NumberFormat Class

The java.text.NumberFormat class is the abstract base class for all classes that support number formatting and parsing. Code that uses a NumberFormat-derived class can be written to be completely independent of the current locale with respect to number conventions (that is, decimal sign, percent sign, separator for thousands, and so on).

The NumberFormat class provides a few static methods that return an appropriate number, currency, or percent format for a specific locale:

NumberFormat defFormat = NumberFormat.getInstance();NumberFormat defCurrFmt = NumberFormat.getCurrencyInstance(); NumberFormat defPctFmt = NumberFormat.getPercentInstance(); NumberFormat frFormat = NumberFormat.getInstance( Locale.FRENCH ); NumberFormat usCurrFmt = NumberFormat.getCurrencyInstance( Locale.US ); getPercentInstanceNumberFormat usPctFmt = NumberFormat.getPercentInstance( Locale.US );

The first three methods in this list return the NumberFormat objects for the default Locale for numbers, currencies, and percents, respectively.

The second three methods also return NumberFormat objects for numbers, currencies, and percents; however, these methods also accept a Locale object (in these examples, Locale.FRENCH and Locale.US) as a parameter and use that as the Locale whose NumberFormat should be returned.

Once you retrieve a NumberFormat object, you can use it to generate a properly formatted number. The following line creates the string for 4.00 monetary units in the default locale:

String moneyString = NumberFormat.getCurrencyInstance().format( 4.00 );

You can also use the NumberFormat class to parse a string that you know is a representation of a number in the current locale. For example, the following statement parses the string "$4.00" and finds the value 4.00 to store in the variable myFour:

Number myFour = NumberFormat.getCurrencyInstance(Locale.US).parse( "$4.00" );

Finally, you can use the NumberFormat class in conjunction with the FieldPosition class to provide a simple way of aligning numbers on different fields, such as the decimal sign, the percent sign, and so on. Table 15.3 lists the important methods in the NumberFormat class.
Table 15.3. Methods in the NumberFormat class.

Method Description

format() Formats the given object into a string; overrides the Format class

parseObject() Parses a string and creates an Object object

parse() Parses a string and returns a Number object

isParseIntegerOnly() Returns true if this parser stops when it hits a decimal point

setParseIntegerOnly() Specifies whether the parser should read past the decimal point

getInstance() Gets the default NumberFormat object for a locale

getNumberInstance() Gets a general-purpose formatter for a locale

getCurrencyInstance() Gets a currency formatter for a locale

getPercentInstance() Gets a percent formatter for a locale

getAvailableLocales() Returns all locales supported by the NumberFormat class

isGroupingUsed() Returns true if grouping is used; example: The number 12345 would be 12,345 with grouping turned on

setGroupingUsed() Specifies whether grouping should be used

get/setMaximumIntegerDigits() Gets or sets the maximum number of integer digits to be used

get/setMinimumIntegerDigits() Gets or sets the minimum number of integer digits to be used

get/setMaximumFractionDigits() Gets or sets the maximum number of fraction digits to be used

get/setMinimumFractionDigits() Gets or sets the minimum number of fraction digits to be used

Listing 15.1 shows how to use the NumberFormat class. You can find this code on the CD-ROM that accompanies this book.
Listing 15.1. NumberFormatExample.java: A sample NumberFormat program.

import java.text.*; import java.util.*; public class NumberFormatExample { public static void main( String args[] ) { try { double averages[] = { 0.456, 0.78, 0.3, 1.0, .25, .345 }; String spaces = " "; System.out.println( "Available Locales for NumberFormat" ); Locale availLocs[] = NumberFormat.getAvailableLocales(); for( int i=0; i<availLocs.length; i++ ) { System.out.println( "\t" + availLocs[i].getDisplayName() ); DecimalFormat fmt = (DecimalFormat) NumberFormat.getInstance( availLocs[i] ); String pattern = fmt.toPattern(); int len = pattern.length(); String newPattern = pattern.substring(0,len-4) + ".000"; fmt = new DecimalFormat( newPattern, new DecimalFormatSymbols(availLocs[i]) ); FieldPosition status = new FieldPosition( NumberFormat.FRACTION_FIELD ); for( int j=0; j<averages.length; j++ ) { StringBuffer sb = new StringBuffer(); fmt.format( averages[j], sb, status ); System.out.println( spaces.substring(0, 20-status.getEndIndex()) + sb.toString() ); } System.out.println(""); } } catch (Exception e) { e.printStackTrace(); } }

}

The output from this program looks like this:

Available Locales for NumberFormat Belorussian (Belarus) 0,456 0,780 0,300 1,000 0,250 0,345 Bulgarian (Bulgaria) 0,456 0,780 0,300 1,000 0,250 0,345

As you can see, the same code block produces different output strings depending on the locale.
The DateFormat Class

The java.text.DateFormat class is an abstract base class for all classes that parse and format dates and times in a localized manner. Like the NumberFormat class, the DateFormat class also provides a number of static functions that retrieve default formats for dates and times:

DateFormat fmt = DateFormat.getDateInstance();DateFormat fmt = DateFormat.getDateInstance( DateFormat.SHORT ); DateFormat fmt = DateFormat.getDateInstance( DateFormat.LONG, Locale.FRENCH ); DateFormat fmt = DateFormat.getDateTimeInstance(); DateFormat fmt = DateFormat.getDateInstance( DateFormat.LONG, DateFormat.SHORT ); DateFormat fmt = DateFormat.getDateInstance( DateFormat.LONG, DateFormat.SHORT, Locale.US ); DateFormat fmt = DateFormat.getTimeInstance(); DateFormat fmt = DateFormat.getTimeInstance( DateFormat.SHORT ); DateFormat fmt = DateFormat.getTimeInstance( DateFormat.LONG, Locale.FRENCH );

The getDateFormat() method returns the default date format for the default locale. The second version of this method, getDateFormat(DateFormat.SHORT), returns the default date format for the given style for the default locale. The style attribute can be SHORT, MEDIUM, LONG, FULL, or DEFAULT. The third form of this method, getDateFormat(DateFormat.LONG, Locale.FRENCH), returns the default date format for the given style in the given locale. The getDateTimeFormat() method returns the default date and time format for the default locale. The getDateFormat(DateFormat.LONG, DateFormat.SHORT) method returns the default date and time format for the given date and time formatting styles for the default locale. The next version of this method, getDateFormat(DateFormat.LONG, DateFormat.SHORT, Locale.US), returns the default date and time format for the given date and time formatting styles for the given locale. The getTimeFormat() method returns the default time format for the default locale. The second version of this method, getTimeFormat(DateFormat.SHORT), returns the default time format with the given style for the default locale. The style attribute for the time methods can have the same value as the style attribute for the date methods: SHORT, LONG, FULL, or DEFAULT. The third version of this method, getTimeFormat(DateFormat.LONG, Locale.FRENCH), returns the default time format with the given style in the given locale.

Here are some examples of the DateFormat attributes:

Attribute Example

SHORT 4/2/97 (completely numeric)

MEDIUM Apr 2, 1997

LONG April 2, 1997

FULL Wednesday, April 2, 1997 AD

The object returned by these functions is usually of type java.text.SimpleDateFormat, which provides a concrete implementation of the abstract DateFormat class. Once you have retrieved a DateFormat object, you can use it to properly format a date or parse a string to find a date.

These field constants are also available in the DateFormat class: AM_PM_FIELD, DATE_FIELD, and DAY_OF_WEEK_FIELD, among others. These field constants are used with the FieldPosition class to help properly align strings formatted with a DateFormat object. For example, if the formatted date is Friday, April 4, 1997, and the FieldPosition class is using the DAY_OF_WEEK_FIELDfor alignment, the DateFormat object determines that the day of the week begins in string[0] and ends in string[5]; the FieldPosition object sets its getBeginIndex() and getEndIndex() methods to 0 and 5, respectively.

Table 15.4 lists the important methods in the DateFormat class.
Table 15.4. Methods in the DateFormat class.

Method Description

format() Formats the given object into a string; overrides class Format

parseObject() Parses a string and creates an Object object

parse() Parses a string and returns a Number object

getTimeInstance() Gets a time formatter for a locale

getDateInstance() Gets a date formatter for a locale

getDateTimeInstance() Gets a date and time formatter for a locale

getInstance() Gets the default DateFormat for a locale

getAvailableLocales() Gets all the locales supported by DateFormat

get/setCalendar() Gets or sets the Calendar to be used by the DateFormat object

get/setNumberFormat() Gets or sets the NumberFormat to be used by the DateFormat object

get/setTimeZone() Gets or sets the TimeZone object for the calendar of this DateFormat object

is/setLenient() Gets or sets whether this object uses lenient parsing; if lenient parsing is used, the parser will use heuristics to interpret the input; if lenient parsing is off, the parser will use strict parsing rules

Listing 15.2 shows some examples of the use of the DateFormat class. You can find this code on the accompanying CD-ROM.
Listing 15.2. DateFormatExample.java: A sample DateFormat program.

import java.text.*; import java.util.*; public class DateFormatExample { public static void main( String args[] ) { try { System.out.println( "Available Locales for DateFormat" ); Locale availLocs[] = DateFormat.getAvailableLocales(); for( int i=0; i<availLocs.length; i++ ) { System.out.println( "\t" + availLocs[i].getDisplayName() ); } SimpleDateFormat fmt = new SimpleDateFormat( "'It is now' H:mm 'on' EEEE',' MMMM d',' yyyy" ); FieldPosition status = new FieldPosition( DateFormat.DAY_OF_WEEK_FIELD ); // format today's date Date today = new Date(); StringBuffer sbToday = new StringBuffer(); fmt.format( today, sbToday, status ); int todayOffset = status.getEndIndex(); // format tomorrow's date Date tomorrow = new Date( today.getTime() + 86400000 ); StringBuffer sbTmw = new StringBuffer(); fmt.format( tomorrow, sbTmw, status ); int tmwOffset = status.getEndIndex(); // format tomorrow+1 Date tp1 = new Date( tomorrow.getTime() + 86400000 ); StringBuffer sbTp1 = new StringBuffer(); fmt.format( tp1, sbTp1, status ); int tp1Offset = status.getEndIndex(); // align all dates in column 40 of the screen using the DAY_OF_WEEK String spaces = " "; System.out.println("Dates"); System.out.print( spaces.substring(0, 40-todayOffset) ); System.out.println( sbToday.toString() ); System.out.print( spaces.substring(0, 40-tmwOffset) ); System.out.println( sbTmw.toString() ); System.out.print( spaces.substring(0, 40-tp1Offset) ); System.out.println( sbTp1.toString() ); // parse a date from a string (reverse-formatting) String dateStr = "It is now 16:26 on Tuesday, February 4, 1997"; Date date = fmt.parse( dateStr ); System.out.println("Parsing"); System.out.println( "\t" + date.toString() ); } catch (Exception e) { e.printStackTrace(); } } }

The output of this program looks something like the following. The actual output from your system may vary depending on which locales are installed.

Available Locales for DateFormat Arabic (Egypt) Belorussian (Belarus) Bulgarian (Bulgaria) Catalan (Spain) Czech (Czech Republic) Danish (Denmark) German (Germany) <more locales here> Dates It is now 14:37 on Monday, April 7, 1997 It is now 14:37 on Tuesday, April 8, 1997 It is now 14:37 on Wednesday, April 9, 1997 Parsing

Tue Feb 04 16:26:00 PST 1997

As you can see, the dates are aligned on the day of the week, and the string we made up is parsed into a java.util.Date object, just as we expected!
The ChoiceFormat Class

The java.text.ChoiceFormat class is a formatter that allows you to attach a pattern to a range of numbers (of type double). Its most common use is in conjunction with the MessageFormat class to handle cases with plurals (for example, "zero objects," "one object," and "many objects"), although it is certainly not limited to such use.

A ChoiceFormat object is specified with a list of ascending numbers (doubles) that determine the limits to be used. A number X falls into a given interval between list[j] and list[j+1] if and only if list[j] <= X < list[j+1]. If X < list[0], then list[0] is used. Similarly, if X > list[N-1] (where there are N items in the list), list[N-1] is used.

For example, if the list is {1.0, 2.0, 3.0}, then the following are true:

0.5 maps to list[0] because 0.5 is less than 1.0 (list[0]) 1.5 also maps to list[0] because 1.0 <= 1.5 < 2.0 2.5 maps to list[1] because 2.0 <= 2.5 < 3.0 3.5 maps to list[2] because 3.0 <= 3.5

Along with the list of numbers that determine the limits is a list of objects. The list of objects has the same number of items as the list of limits and contains the items to be used as the formats for the corresponding limits. Although it sounds confusing, it is really very simple. Table 15.5 lists the important methods in the ChoiceFormat class.
Table 15.5. Methods in the ChoiceFormat class.

Method Description

applyPattern() Sets the pattern for a ChoiceFormat object

toPattern() Gets the pattern for a ChoiceFormat object

setChoices() Sets the limits for a ChoiceFormat object

getLimits() Gets the limits for a ChoiceFormat object

getFormats() Gets the formats for a ChoiceFormat object

format() Formats an object into a string

parse() Parses a string and creates a Number object

nextDouble() Finds the next double greater than or equal to a given value

previousDouble() Finds the next double less than or equal to a given value

Listing 15.3 shows a simple example. You can also find this example on the CD-ROM that accompanies this book.
Listing 15.3. SimpleChoiceFormatExample.java: A sample ChoiceFormat program.

import java.text.*; import java.util.*; public class SimpleChoiceFormatExample { public static void main( String args[] ) { try { double[] limits = { 1, 4, 7, 10 }; String[] seasons = { "Winter", "Spring", "Summer", "Autumn" }; ChoiceFormat fmt = new ChoiceFormat( limits, seasons ); for (int i = 1; i <= 12; ++i) { System.out.println( "Month number " + i + " falls in " + fmt.format(i) ); } } catch (Exception e) { e.printStackTrace(); } }

}

This program prints the following output:

Month number 1 falls in WinterMonth number 2 falls in Winter Month number 3 falls in Winter Month number 4 falls in Spring Month number 5 falls in Spring Month number 6 falls in Spring Month number 7 falls in Summer Month number 8 falls in Summer Month number 9 falls in Summer Month number 10 falls in Autumn Month number 11 falls in Autumn Month number 12 falls in Autumn

By letting the formatter do the work, you save the effort of doing many comparisons to determine the season in which a particular month falls. The next section makes it clearer why the ChoiceFormat class is useful.
The MessageFormat Class

The java.text.MessageFormat class provides a simple way to get concatenated messages in a language-neutral (localized) way. A MessageFormat object has a specified pattern and, optionally, a list of Format objects associated with it. A MessageFormat object's specification is of the following form:

MessageFormat fmt = new MessageFormat( "The incoming fax from {0} has a total of {1} pages.");

The string passed in is called the pattern. The pattern is used by the MessageFormat object when formatting and is subject to the following set of rules:

1. The syntax {N} (0 <= N <= 9) indicates that the Nth argument in the list of arguments passed to format() should be formatted using format N (which is specified by calling setFormat() ). An example pattern for this rule is "The person's name is {0}".

2. The optional syntax {N, <elementType>} indicates that the Nth argument should be formatted with the Nth format, subject to the constraints set by <elementType>. The valid element types are time, date, number, and choice. If an element type is provided, the formatter assumes that the argument is of the type indicated and throws an exception if it is not. If no element type is provided (as in rule 1 just listed), it is assumed to be a string. A sample pattern is "It is now {0,time} on {1,date}, and I am {3,number} years old." The formatter assumes that argument 0 represents a time, argument 1 represents a date, and argument 3 represents a number.

3. The elementType can have a style. Valid styles for dates and times are SHORT, MEDIUM, LONG, and FULL. Valid styles for numbers are currency, percent, and integer.

Table 15.6 lists the important methods found in the MessageFormat class.
Table 15.6. Methods in the MessageFormat class.

Method Description

get/setLocale() Gets or sets the locale for the MessageFormat object

applyPattern() Sets the pattern for the object

toPattern() Gets the pattern for the object

setFormats() Sets all the formats to be used by the object

setFormat() Sets an individual format to be used by the object

getFormats() Gets all the formats for the object

format() Formats an object and returns a string

parse() Parses a string and returns an array of objects

parseObject() Parses a string and returns the next object

Listing 15.4 shows a simple example of how to use MessageFormat objects. You can find the code on the CD-ROM that accompanies this book.
Listing 15.4. MessageFormatExample.java: A sample MessageFormat program.

import java.text.*; import java.util.*; public class MessageFormatExample { public static void main( String args[] ) { try { MessageFormat fmt = new MessageFormat( "The fax from {1} has {0} pages." ); Object fmtArgs[] = { new Long(5), "Joe Schmo" }; System.out.println( fmt.toPattern() + "; " + fmt.format(fmtArgs) ); } catch (Exception e) { e.printStackTrace(); } } }

The output of this short code snippet is as follows:

The fax from {1} has {0} pages.; The fax from Joe Schmo has 5 pages.

Because we did not specify any formats for the particular arguments, the MessageFormat object simply substituted fmtArgs[0] for {0} and fmtArgs[1] for {1}. Listing 15.5 uses a ChoiceFormat object in conjunction with a MessageFormat object to create a formatted string. You can find this file on the CD-ROM that accompanies this book.
Listing 15.5. MessageFormatExample2.java: A more complex MessageFormat program.

import java.text.*; public class MessageFormatExample2 { public static void main( String args[] ) { try { // the limits to use for ChoiceFormat double[] limits = { 0, 1, 2}; // strings for 0, 1, and >1 pages String[] pages = { "no pages", "one page", "{1,number} pages" }; // a ChoiceFormat object based on the given limits ChoiceFormat chFmt = new ChoiceFormat( limits, pages ); // senders of faxes String[] senders = { "Joe", "Fred", "Mary" }; // formats to use for the arguments in the MessageFormat Format[] testFormats = { null, chFmt }; MessageFormat messFmt = new MessageFormat( "The fax from {0} has {1}." ); messFmt.setFormat( 1, chFmt ); for (int i = 0; i < 3; ++i) { // an array of Objects to pass to the MessageFormat Object[] testArgs = { senders[i], new Long(i) }; // format the arguments and print out the resulting string System.out.println( messFmt.toPattern() + " -> " + messFmt.format(testArgs) ); } } catch (Exception e) { e.printStackTrace(); } } }

The output of this code is as follows:

The fax from {0} has {1,choice,0.0#no pages|1.0#one page|2.0#{1,number} pages}. -> The fax from Joe has no pages. The fax from {0} has {1,choice,0.0#no pages|1.0#one page|2.0#{1,number} pages}. -> The fax from Fred has one page. The fax from {0} has {1,choice,0.0#no pages|1.0#one page|2.0#{1,number} pages}. -> The fax from Mary has 2 pages.
Collators

The java.text.Collator class is an abstract class that provides a common interface for the language-sensitive comparison of strings, text searches, and alphabetical sorting. The Collator class hides from the developer the nuances of any individual locale so that you can use the same code in any local setting. Table 15.7 lists the important methods used in the Collator class.
Table 15.7. Methods in the Collator class.

Method Description

getInstance() Gets a Collator for a locale

compare() Compares two strings according to the rules for a Collator

getCollationKey() Transforms a string into a set of bits that can be

compared (bitwise) to other CollationKeys from the same Collator

get/setStrength() Gets or sets the strength of the Collator; legal values are PRIMARY, SECONDARY, and TERTIARY

get/setDecomposition() Gets or sets the decomposition mode for the Collator; legal values are NO_DECOMPOSITION, CANONICAL_DECOMPOSITION, and FULL_DECOMPOSITION

getAvailableLocales() Gets a list of locales supported by the Collator class

Basic Collation

Languages throughout the world differ with respect to both the characters they use and the way they treat those characters when comparing and sorting them. There are four areas that apply to correct string comparison and sorting: ordering characters, grouping characters, expanding characters, and ignoring characters. The following sections explain these areas in more detail.

NOTE: Java uses the Unicode representation of strings instead of using multibyte representation.

Ordering

There are three types of orderings: primary, secondary, and tertiary. When you compare two strings, you first do so by comparing characters at the same positions in each string. The first difference in this primary ordering determines the order of the strings, regardless of the remaining characters. For example, "deed" is less than "definition". The first primary difference is in the third character, where "e" is less than "f". In languages such as English and German, the primary ordering is in base letters. For example, "a" is different than "b" but is not different from "A". Remember that Java uses the Unicode representation of characters, which is actually a superset of the ASCII character set. Punctuation such as spaces and quotation marks precede numbers, which precede letters in the ordering.

If the primary ordering shows that the strings are identical, the comparison then moves to secondary ordering. In English, the secondary ordering is in the case of the characters. Thus, "apple" is less than "Apple". In this example, the primary ordering of the characters is the same, but the first characters have a secondary difference. In languages such as Czech and French, the secondary ordering is in accents.

Finally, if all secondary orderings are also identical, the comparison moves to tertiary ordering. For example, in Czech, the secondary ordering is based on accent marks ("e" is less than "è") and the tertiary ordering is based on case.
Groups of Characters

Some languages stipulate that a certain sequence of characters should be treated as a single character. For example, in Spanish, "c" is less than "ch" which is less than "d", because "ch" is treated as a base character in itself and is placed in the ordering between the characters "c" and "d".

Note that the language normally determines when a set of characters should be treated as a single character (as in the Spanish "ch" example), but the programmer can also insert his or her own grouped characters, as we will see later in this chapter with the RuleBasedCollator example.
Expanding Characters

Some languages stipulate that a single character be treated as a sequence of characters. For example, in German, "s" is less than "ß" which is less than "t". In this case, the character "ß" is treated as though it is the characters "ss" for ordering purposes.
Ignorable Characters

Most languages have certain characters that can be ignored when you compare and sort strings. That is, some characters are not significant unless there are no differences in the remainder of the string. In English, one such character is the dash (-). For example, "foobar" is less than "foo-bar" which is less than "foobars".

For any given collation operation, you can specify a strength (PRIMARY, SECONDARY, or TERTIARY). The strength of a collation is the highest level at which comparisons are made; differences in levels beyond the specified strength are ignored. For example, if you set the strength of a collation to SECONDARY, any characters that have tertiary differences are reported as being equal. The strength of a collator can be set using Collator.setStrength().
Comparing Strings Using the CollationKey Class

It is simple to compare two strings using the Collator.compare() method. However, the comparison algorithm used by Collator.compare() is very complex. If you are sorting long lists of strings, the operation may be very slow because compare() repetitively compares the same strings. As an alternative, you can use the java.text.CollationKey class, which is a key representing a given string in a collation. You can generate CollationKey objects for all your strings and cache them for use in all comparisons instead of using the strings themselves. Because they are bit-ordered, CollationKey objects allow you to do bitwise comparisons; in addition, once the keys are generated, comparisons are faster than direct comparisons of the two strings.

Table 15.8 lists the important methods in the CollationKey class.
Table 15.8. Methods in the CollationKey class.

Method Description

compareTo() Compares the CollationKey to another CollationKey object from the same Collator

getSourceString() Returns a reference to the actual string which maps to the CollationKey under the given Collator

toByteArray() Converts the CollationKey to a sequence of bits; used for bitwise comparison of keys

TIP: Here is a good rule of thumb when deciding whether to use direct comparison or comparisons using CollationKeys: If you compare the strings more than once, use CollationKeys; if you compare the strings only once, use the direct comparison. Also note that you cannot compare CollationKey objects from different Collator objects.

Decomposition Modes

Decomposing characters is another way of saying preparing characters for sorting. Decomposing characters involves the four attributes discussed earlier: ordering, grouping, expanding, and ignoring characters. Decomposing characters is the process of actually applying a language's rules to a given set of characters.

When you are dealing with Unicode characters, there are three decomposition modes to consider:

No decomposition. Accented characters are not sorted correctly. You should use this mode only if you can guarantee that the source text has absolutely no accented characters. This mode is not recommended.

Canonical decompositio Characters that are canonical variants under Unicode 2.0 (such as accents) are decomposed when collated. This is the default mode, and must be used if accents and other canonical variants are to be correctly collated. Canonical variants are letters such as "e" and "è", which differ only in that one has an accent mark.

Full decompositio Both canonical variants and compatibility variants are decomposed. A compatibility variant is a character that has a special format to be sorted with its normalized form, such as half-width and full-width ASCII and Katakana characters.

NOTE: Although NO_DECOMPOSITION is the fastest decomposition mode, it is not correct if accented characters appear. For correctness, it is recommended that you use either CANONICAL_DECOMPOSITION or FULL DECOMPOSITION. To set the decomposition mode for a collator, use the setDecomposition() method.

The RuleBasedCollator Class

The java.text.RuleBasedCollator class is a concrete class that provides a very simple Collator implementation using tables (hence the name). This class uses a set of collation rules to determine the result of comparisons. The rules can take on three different forms:

<modifier>

<relation> <text argument>

<reset> <text argument>

The definitions for each component of these rules are as follows:

Component Description

modifier There is only one modifier, "@", which indicates that allUcondary cdeg.[partialdiff]ferences are ordered in reverse.

relation There are four relations. The first three--"<", ";", and ","--mean "greater than" for primary, secondary, and tertiary differences, respectively. The fourth relation is "=", which means "equal."

reset There is only one reset, "&", which specifies that the next rule follows the position in which the reset argument would be sorted. The reset argument follows the "&", as in "a < b & a < c". In this case, "a" is the reset argument and "c" is placed in the list after text argument "a", yielding "a < c < b".

Any sequence of characters, excluding "special" characters (those characters contained in whitespace or used as modifiers, relations, and resets). To use a special character within a string, place it in single quotation marks (for example, `&').

Here are some simple examples of rules:

Example Comment

a < b < c

a < c & a < b Equivalent to a < b < c

a < b & b < c Equivalent to a < b < c

- < a < b < c `-' can be ignored because it preceded the first relation

Here are some rules that create errors (and thus throw FormatException exceptions):

Example Comment

a < b & c < d c has not been put into the rules and thus cannot be used as a reset argument

w < x y < z There is no relation between x and y

w <, x There is no text argument between the relations < and ,

NOTE: The RuleBasedCollator class has a few restrictions for efficiency:

It uses the secondary ordering rules for the French language for the entire object

Any unmentioned Unicode characters come at the end of the collation order

Private-use characters (that is, Unicode characters 0xe800 through 0xf8ff) are all treated as identical

These restrictions should be important only to advanced users of the RuleBasedCollator class; they are mentioned here for completeness.

Table 15.9 lists the important methods in the RuleBasedCollator class.
Table 15.9. Methods in the RuleBasedCollator class.

Method Description

getRules() Returns a string representation of the rules for the object

getCollationElementIterator() Gets a CollationElementIterator object for a given string under the RuleBasedCollator

compare() Compares two strings based on the rules in the RuleBasedCollator

getCollationKey() Gets a CollationKey for a given string under the rules for the RuleBasedCollator

To explain the use of the RuleBasedCollator class, we need a useful example. Many companies assign a "grade level" to their employees; this grade level consists of a letter followed by a number. Our fictional company uses this ordering system: A1, A2, B1, A3, B2, and B3. Additionally, for the A grades, a lowercase a represents someone who is in training to become that given grade, but is not quite there. Thus, we define a tertiary difference between a lowercase a and an uppercase A grade. Listing 15.6 augments the basic United States RuleBasedCollator rules with our rules and sorts a list of grade levels using these rules (the code is also located on the CD-ROM that accompanies this book). Note that each of the grade levels is considered by the Collator class to be a single group character, not a string of multiple characters.
Listing 15.6. RuleBasedCollatorExample.java: A sample RuleBasedCollator program.

import java.text.*; import java.util.*; public class RuleBasedCollatorExample { public static void main( String args[] ) { try { // make a collation with rules from the US RuleBasedCollator collUS = (RuleBasedCollator) Collator.getInstance(Locale.US); // provide the ordering for levels // no need to do the C's because they will have a primary difference String newRules = "< B1 < a1, A1 < a2, A2 < a3, A3 < B2 < B3"; String sampleInput[] = { "B1", "a1", "A3", "A1", "B3", "B2", "a2", "A2", "B1" }; RuleBasedCollator newColl = new RuleBasedCollator( newRules ); newColl.setStrength( Collator.TERTIARY ); CollationKey keys[] = new CollationKey[sampleInput.length]; // print the original list for( int i=0; i<sampleInput.length; i++ ) { System.out.print( sampleInput[i] + " " ); keys[i] = newColl.getCollationKey( sampleInput[i] ); } System.out.println(""); // sort the list for( int i=0; i<sampleInput.length-1; i++ ) { for (int j=i+1; j<sampleInput.length; j++ ) { if( keys[i].compareTo(keys[j]) > 0 ) { CollationKey tmpkey = keys[i]; keys[i] = keys[j]; keys[j] = tmpkey; String tmp = sampleInput[i]; sampleInput[i] = sampleInput[j]; sampleInput[j] = tmp; } } } // print the sorted list for( int i=0; i<sampleInput.length; i++ ) { System.out.print( sampleInput[i] + " " ); } System.out.println(""); } catch (Exception e) { e.printStackTrace(); } } }

The output of the program is as follows:

B1 a1 A3 A1 B3 B2 a2 A2 B1B1 B1 a1 A1 a2 A2 A3 B2 B3

Iterators

There are two classes of iterators in the java.text package: the CollationElementIterator, which is used to iterate through each character of an international string; and StringCharacterIterator, which implements the CharacterIterator interface and is used for bidirectional iteration over a given string.
The CollationElementIterator Class

The java.text.CollationElementIterator class allows you to go through each character of an international string and return the ordering priority of the positioned character. The "key" of a character is an integer that comprises the primary, secondary, and tertiary orders for the character. The primary order is of type short (16 bits); the secondary and tertiary orders are of type byte (8 bits). This integer is formed internally to the iterator based on other characters in the string.

NOTE: Java strictly defines the size and sign of the types short and byte. To ensure that the key value for a given character is correct, the function primaryOrder() returns an int (not a short) because we need a true, unsigned, 16-bit number. Similarly, the functions secondaryOrder() and tertiaryOrder() return values of type short (and not byte).

Table 15.10 lists the important methods in the CollationElementIterator class.
Table 15.10. Methods in the CollationElementIterator class.

Method Description

reset() Resets the iterator's marker to the beginning of the string

next() Gets the next character from the iterator

primaryOrder() Gets the primary order of the given character

secondaryOrder() Gets the secondary order of the given character

tertiaryOrder() Gets the tertiary order of the given character

Listing 15.7 provides an example using the CollationElementIterator class. It is similar to Listing 15.6 in that it uses the same RuleBasedCollator object. As usual, you can find this code on the CD-ROM that accompanies this book.
Listing 15.7. CollationElementIteratorExample.java: A sample CollationElementIterator program.

import java.text.*; import java.util.*; public class CollationElementIteratorExample { public static void main( String args[] ) { try { // make a collation with rules from the US RuleBasedCollator collUS = (RuleBasedCollator) Collator.getInstance(Locale.US); // provide the ordering for levels String newRules = "< B1 < a1, A1 < a2, A2 < a3, A3 < B2 < B3"; // sample list String sampleInput[] = { "B1a1A3A1B3B2a2A2B1" }; RuleBasedCollator newColl = new RuleBasedCollator( collUS.getRules() + newRules ); // sort the list for( int i=0; i<sampleInput.length; i++ ) { CollationElementIterator iter = newColl.getCollationElementIterator( sampleInput[i] ); int next; int count = 0; while( (next = iter.next()) != CollationElementIterator.NULLORDER ) { int pri = CollationElementIterator.primaryOrder( next ); int sec = CollationElementIterator.secondaryOrder( next ); int ter = CollationElementIterator.tertiaryOrder( next ); System.out.println( "orderings for character " + count + " of string " + i + " are " + pri + "," + sec + "," + ter ); count++; } System.out.println(""); } } catch (Exception e) { e.printStackTrace(); } } }

The output of this program is as follows:

orderings for character 0 of string 0 are 96,0,0orderings for character 1 of string 0 are 97,0,0 orderings for character 2 of string 0 are 99,0,1 orderings for character 3 of string 0 are 97,0,1 orderings for character 4 of string 0 are 101,0,0 orderings for character 5 of string 0 are 100,0,0 orderings for character 6 of string 0 are 98,0,0 orderings for character 7 of string 0 are 98,0,1 orderings for character 8 of string 0 are 96,0,0

Notice that characters 6 and 7 have the same primary value (the value 98); this is because a2 and A2 are the same as far as primary comparisons go. However, there is a case difference between a2 and A2 as our rules are written; thus, there is a tertiary difference (the third value; character 6 has value 0 and character 7 has value 1).
The StringCharacterIterator Class

The StringCharacterIterator class implements the CharacterIterator interface for strings. The CharacterIterator interface specifies a protocol for the bidirectional iteration over text, on a range of character positions bounded by startIndex and endIndex-1.

Table 15.11 shows four methods from this class which allow you to manipulate and query the indices of this iterator.
Table 15.11. Methods from the StringCharacterIterator class.

Method Description

startIndex() Retrieves the starting index for the given iterator

endIndex() Retrieves the ending index

getIndex() Retrieves the index of the character currently being used by the iterator

setIndex() Changes the current index

Three methods allow you to retrieve the actual characters from the text stored in the iterator:

Method Description

current() Returns the character at the current index

previous() Decrements the index by 1 and returns the character at the new index

next() Increments the index by 1 and returns the character at the new index

With these seven methods, you can easily find a combination that enables you to iterate through any given string in any manner you choose. In Listing 15.9, we start at position 7 of a given string and iterate through the string in two directions: from position 7 to the end (forward) and from position 7 to the beginning (backward). Finally, we use the java.text.BreakIterator class to break up the string at each word. You can find this code on the CD-ROM that accompanies this book.
Listing 15.9. StringCharacterIteratorExample.java: A sample StringCharacterIterator program.

import java.text.*; import java.util.*; public class StringCharacterIteratorExample { public static void main( String args[] ) { try { String source = "This is my string! Hey there."; StringCharacterIterator iter = new StringCharacterIterator( source ); int pos = 7; System.out.println( "Starting from position " + pos ); System.out.print( "\t" ); for (char c = iter.setIndex(pos); c != CharacterIterator.DONE && iter.getIndex() <= iter.getEndIndex(); c = iter.next()) { System.out.print( c ); } System.out.print( "\n" ); System.out.print( "\t" ); for (char c = iter.setIndex(pos); c != CharacterIterator.DONE && iter.getIndex() >= iter.getBeginIndex(); c = iter.previous()) { System.out.print( c ); } System.out.print( "\n" ); BreakIterator bd = BreakIterator.getWordInstance(); bd.setText( iter ); int start = bd.first(); for ( int end = bd.next(); end != BreakIterator.DONE; start = end, end = bd.next() ) { System.out.println( source.substring(start, end) ); } } catch (Exception e) { e.printStackTrace(); } } }

The output of this program is as follows:

Starting from position 7 my string! Hey there. si sihT This is my string ! Hey there.

Summary

Writing inherently international code is not easy; many large software houses spend a lot of money in their efforts to make localization simpler. What the JDK 1.1 provides in the java.text package is a huge first step toward helping all Java developers create clean, international code. Clean, international code, in turn, will help make Java the global language we all want it to be. As the Web grows, we can no longer assume anything about the demographics of those who use what we publish.

This chapter has given you some insight not only into how to use the package, but also into what it means to make a program international and localizable. As a developer, you must shift your thought processes when you write international code; as you do that, the ideas behind the java.text package will become clearer and you will fully understand how to use all the classes described in this chapter. Good luck! Viel Glück! Bonne chance! Chuk li ho wan! Kou-unn wo Negattemasu! Bueno suerte!

©Copyright, Macmillan Computer Publishing. All rights reserved.

Java 1.1 Unleashed

-15- The Text Package

Table 15.1. Classes and interfaces available in the text package.

Formats

Table 15.2. Methods in the Format class.

The NumberFormat Class

Table 15.3. Methods in the NumberFormat class.

Listing 15.1. NumberFormatExample.java: A sample NumberFormat program.

The DateFormat Class

Table 15.4. Methods in the DateFormat class.

Listing 15.2. DateFormatExample.java: A sample DateFormat program.

The ChoiceFormat Class

Table 15.5. Methods in the ChoiceFormat class.

Listing 15.3. SimpleChoiceFormatExample.java: A sample ChoiceFormat program.

The MessageFormat Class

Table 15.6. Methods in the MessageFormat class.

Listing 15.4. MessageFormatExample.java: A sample MessageFormat program.

Listing 15.5. MessageFormatExample2.java: A more complex MessageFormat program.

Collators

Table 15.7. Methods in the Collator class.

Basic Collation

Ordering

Groups of Characters

Expanding Characters

Ignorable Characters

Comparing Strings Using the CollationKey Class

Table 15.8. Methods in the CollationKey class.

Decomposition Modes

The RuleBasedCollator Class

Table 15.9. Methods in the RuleBasedCollator class.

Listing 15.6. RuleBasedCollatorExample.java: A sample RuleBasedCollator program.

Iterators

The CollationElementIterator Class

Table 15.10. Methods in the CollationElementIterator class.

Listing 15.7. CollationElementIteratorExample.java: A sample CollationElementIterator program.

The StringCharacterIterator Class

Table 15.11. Methods from the StringCharacterIterator class.

Listing 15.9. StringCharacterIteratorExample.java: A sample StringCharacterIterator program.

Summary

-15-
The Text Package