In the previous chapters of Part 6, you learned a variety of approaches to making user interfaces more lively and attractive by incorporating multimedia features. Some of the most interesting new technologies in user interface design allow users and computers to talk to each other. Speech recognition enables users to translate speech into commands, data, and text, and simplifies the interface between the user and the computer. Speech synthesis enables the computer to provide output to the user via the spoken word. Although these technologies have been available for a few years, they haven't yet been integrated into mainstream software applications. The Java Speech API, which is being developed by Sun and several other companies, will bridge this gap and make speech capabilities standard features in Java applications.
One of the most common devices that we use to speak and listen is the telephone. Mobile devices, such as the Nokia 9000i, are being developed that integrate computer and telephone capabilities. The Java Telephony API is designed to incorporate telephony features into Java applications. This API will let you place and answer calls from within a Java application, provide touch-tone navigation, and manage multiple telephone connections. A number of advanced telephony capabilities are also being planned.
In this chapter, you'll preview the Speech and Telephony APIs and learn about the capabilities they will provide. You'll learn how Java Speech will be used to add speech recognition and synthesis to your programs, and how Java Telephony will be used to develop sophisticated telephony applications. When you finish this chapter, you'll understand what these two important APIs can bring to your Java programs.
The Java Speech API provides the capability to incorporate speech technology (both input and output) into Java applets and applications. When it becomes available, it will support speech-based program navigation, speech-to-text translation, and speech synthesis. The Java Speech API is being developed by Sun in collaboration with IBM, AT&T, Texas Instruments, Phillips, Apple, and other companies. At the time of this writing, the following specifications had been developed:
These products are available at the Java Speech Web site, located at http://java.sun.com:80/products/java-media/speech/index.html.
The Speech API consists of the following three packages:
The javax.speech package consists of the following classes and interfaces:
The following sections cover the javax.speech.recognition and javax.speech.synthesis packages.
Speech recognition allows computers to listen to a user's speech and determine what the user has said. It can range from simple, discrete command recognition to continuous speech translation. Although speech recognition has made much progress over the last few years, most recognition systems still make frequent errors. These errors can be reduced by using better microphones, reducing background noise, and constraining the speech recognition task. Speech recognition constraints are implemented in terms of grammars that limit the variety in user input. The JSGF provides the capability to specify rule grammars, which are used for speech recognition systems that are command- and control-oriented. These systems only recognize speech as it pertains to program operation and do not support general dictation capabilities.
Even with the constraints posed by grammars, errors still occur and must be corrected. Almost all applications that employ speech recognition must provide error-correction facilities.
Speech recognition is supported by the javax.speech.recognition package, which consists of 15 interfaces and 19 classes. These classes and interfaces make up four major groups: Recognizer, Grammar, Rule, and Result.
The Recognizer interface extends the Engine interface to provide access to a speech recognition engine. RecognizerAttributes and RecognizerModeDesc are used to access the attributes and operational modes of the Recognizer. Recognizer objects generate RecognizerEvent objects as they change state during speech processing. The RecognizerInterface defines methods for handling these events. The RecognizerAdapter class provides a default implementation of this interface. The AudioLevelEvent is generated as a result of a change in the audio level of a Recognizer. The RecognizerAudioListener interface defines methods for handling this event, and the RecognizerAudioAdapter class provides a default interface implementation.
The Grammar interface provides methods for handling the grammars used by a Recognizer. It is extended by RuleGrammar and DictationGrammar, which support rule grammars and dictation grammars. The GrammarSyntaxDetail class is used to identify errors in a grammar. The GrammarEvent class is used to signify the generation of Result object that matches a Grammar. The GrammarListener interface defines methods for handling this event, and the GrammarAdapter class provides a default implementation of this interface.
The Rule class encapsulates rules that are used with a RuleGrammar. It is extended by RuleAlternatives, RuleCount, RuleName, RuleParse, RuleSequence, RuleTag, and RuleToken, which specify different aspects of grammar rules.
The Result interface provides access to the recognition results generated by a Recognizer. The FinalResult interface is used for results that have been finalized (accepted or rejected). It is extended by FinalRuleResult and FinalDictationResult to provide additional information for RuleGrammar and DictationGrammar objects. The ResultToken interface provides access to a single word of a Result. The ResultEvent class is used to signal the status of results that are generated by a Recognizer. It is handled via the ResultListener interface and the default ResultAdapter class. The GrammarResultAdapter class is used to handle both ResultEvent and GrammarEvent objects.
The SpeakerManager interface is used to manage speaker profiles for a Recognizer.
Speech synthesis is the opposite of speech recognition. It allows computers to generate spoken output to users. It can take the form of bulk text-to-speech translation, or of intricate speech-based responses that are integrated into an application's interface.
Speech synthesis systems must satisfy the two main requirements of understandability and naturalness. Understandability is improved by providing adequate pronunciation information to speech generators. This eliminates "guesses" on the part of the speech synthesizer. JSML is used to provide pronunciation information, as required. Naturalness is improved by using a non-mechanical voice and managing emphasis, intonation, phrasing, and pausing. JSML also provides markup capabilities that control these speech attributes.
When you're synthesizing speech, it is often desirable to select attributes of the voice that is generated. For example, you might want to choose between male and female voices or old and young voices. The Speech API provides control over these features. In addition, text that is to be synthesized can be marked up with event markers that cause events to be generated as they are processed. Event handlers can be designed to manipulate graphical interface components in synchronization with speech synthesis. For example, you can design a speaker's face that changes facial expressions as it "talks."
The flexibility of the synthesis component of the Speech API is provided by JSML. JSML, like HTML, is an SGML-based markup language. JSML allows text to be marked up using the following synthesis-related information:
These capabilities may not have your computer reading poetry, but they will allow you to greatly enhance any speech that it generates. Listing 23.1 provides an example of a JSML file.
<?XML version="1.0" encoding="UCS-2"?> <JSML> <PARA><SENT>This is the <EMP>first</EMP> sentence of the first paragraph.</SENT> <SENT>This is the second sentence.</SENT><BREAK SIZE = "large"/><SENT>This is the <EMP>last</EMP> sentence of this paragraph.</SENT></PARA> <PARA><PROS RATE="+10%" VOL=".9"><SENT>This is the second paragraph.</SENT></PARA>
</JSML>
The first line identifies the file as being XML version 1.0. The <JSML> and </JSML> tags surround the JSML markup. Within these tags are two paragraphs marked by the <PARA> and </PARA> tags. The first paragraph consists of three sentences marked by <SENT> and </SENT>. The second paragraph has a single sentence.
The word first is surrounded by <EMP> and </EMP>. This signifies that the word "first" should be emphasized. The <BREAK SIZE="large"/> tag specifies that a long pause should occur between the second and third sentences.
In the second paragraph, the <PROS RATE="+10%" VOL=".9"> tag is an example of a prosody tag. Prosody tags control the timing, intonation, and phrasing of speech. The RATE attribute specifies that the speech rate should be increased by 10%. The VOL attribute specifies that the volume of speech should be set at 90% of its maximum.
The example JSML file illustrates the use of JSML tags. However, the markup language is much richer than indicated by the example. For more information on JSML, download the JSML specification from the Java Speech Web site.
JSML is a subset of the eXtensible Markup Language (XML), which is a subset of the Standard Generalized Markup Language (SGML). JSML looks like HTML with different tags. (HTML is also a subset of SGML.) Figure 23.1 shows the relationship between JSML, XML, HTML, and SGML.
FIGURE 23.1. The relationship between JSML and other markup languages.
The Java Speech API supports speech generation via the javax.speech.synthesis package. This package provides the following five interfaces and six classes:
To use the synthesizer package, invoke the createSynthesizer() method of the Central class to create a Synthesizer object. Pass an argument of the SynthesizerModeDesc class to createSynthesizer() to specify the Synthesizer mode. Invoke the allocate() method of the Synthesizer (inherited from Engine) to start up the Synthesizer's engine. After that, use the speak() and speakPlainText() methods to put text in the Synthesizer's input queue.
The Java Telephony API (JTAPI) is a set of APIs that provide telephony capabilities for Java applications. It supports basic telephony capabilities, such as call placement and call answering, and advanced capabilities, such as call centers and media streams. JTAPI provides both direct control over telephony resources and indirect access through networked resources. This means that you can create server applications that provide telephony resources over a network, and client applications that use these resources.
The JTAPI consists of the following 18 packages:
Although the number of packages provided with JTAPI may seem ominous, basic telephony applications are constructed using a few common elements of the javax.telephony package. The Terminal interface is used to provide access to a physical hardware device at the endpoint of a telephone connection. For example, telephone sets are accessed as Terminal objects. The TerminalConnection interface provides physical access to a telephone connection, while the Connection interface models a logical connection between a Call object and an Address object. An instance of the Call interface represents an actual telephone call. The Address interface is used with a telephone number, or in Internet telephony applications, an IP address combined with other endpoint information. The Provider interface is used to access a telephony/service provider software element.
The JTAPI home page, located at http://java.sun.com/products/jtapi/index.html, provides information about the current status of the JTAPI project.
In this chapter, you were introduced to the Speech and Telephony APIs and learned about their planned capabilities. You learned how Java Speech will be used to add speech recognition and synthesis to Java programs and how Java Telephony will be used to develop sophisticated telephony applications. This was the last chapter of Part VI. In Part VII, you'll learn how to develop component-based Java software using JavaBeans.
© Copyright 1998, Macmillan Publishing. All rights reserved.