Ch 25 -- Developing Content and Protocol Handlers

Java 1.1 Unleashed

- 25 -
Developing Content and Protocol Handlers

by Mike Fletcher revised by Stephen Ingram

IN THIS CHAPTER

What Are Protocol and Content Handlers?
Creating a Protocol Handler
Creating a Content Handler

Java's URL class gives applets and applications easy access to the World Wide Web using the HTTP protocol. This is fine and dandy if you can get the information you need into a format that a Web server or CGI script can access. However, wouldn't it be nice if your code could talk directly to the server application without going through an intermediary CGI script or some sort of proxy? Wouldn't you like your Java-based Web browser to be able to display your wonderful new image format? This is where protocol and content handlers come in.
What Are Protocol and Content Handlers?

Handlers are classes that extend the capabilities of the standard URL class. The URL class actually hides a complex web of classes that facilitate extensible support for any number of protocols and content types.

The entire protocol scheme is implemented by a tandem of classes: URLStreamHandler and URLConnection. When you create a URL object, the protocol name is parsed out of the string and used to create a URLStreamHandler descendant. By itself, URLStreamHandler is rather unimpressive. What it provides is a layer of abstraction that isolates a protocol's implementation from the search and load architecture of the URL class. Although URLStreamHandler acts much like an interface, because it is a class and not an interface, it can be represented by a physical class file and is thus eligible for dynamic loading. URLStreamHandler provides openConnection()--the bridge method to jump from a standard URL to the implementation. What is returned is the class that performs the actual protocol processing: a descendant of URLConnection. If you examine the URL class methods getContent() and openStream(), you find a two-method sequence that first creates and then calls a URLConnection:

return handler.openConnection().getContent(); // URL code for getContent() return handler.openConnection().getInputStream(); // URL code for openStream()

In this way, the URLConnection class performs all the protocol-specific work for the URL class. Implementing a protocol handler actually involves implementing both URLStreamHandler and URLConnection descendants. The only class that is aware of the existence of your descendant URLConnection class is your specific URLStreamHandler object.

Content handlers work in a way similar to protocol handlers, but because content handlers interpret streams of input into a single Java object (String or Image and so on), all the processing can be isolated to a single descendant of the ContentHandler class. There is no need for the extra layer of abstraction that protocol handlers implement. Protocol handlers allow two-stage interaction--before and after connection. Content handlers have only a single direct access method (getContent()) and so can implement all their processing directly in the handler.

The URL object cannot parse the content type from the input string. Instead, it has to wait until a protocol handler executes and extracts the content type from the resulting stream. How are content types encoded in the data stream? They are represented as MIME types.
MIME Types

MIME (Multipurpose Internet Mail Extensions) is the Internet standard for specifying the type of content a resource contains. As you may have guessed from the name, MIME was originally proposed for the context of enclosing nontextual components in Internet e-mail. MIME allows different platforms (PCs, Macintoshes, UNIX workstations, and others) to exchange multimedia content in a common format.

The MIME standard, described in RFC 1521, defines an extra set of headers similar to those on Internet e-mail. The headers describe attributes such as the method of encoding the content and the MIME content type. MIME types are written as type/subtype, where type is a general category such as text or image and subtype is a more specific description of the format such as html or jpeg. For example, when a Web browser contacts an HTTP daemon to retrieve an HTML file, the daemon's response looks something like this:

Content-type: text/html <HEAD><TITLE>Document moved</TITLE></HEAD> <BODY><H1>Document moved</H1>

The Web browser parses the Content-type: header and sees that the data is text/html--an HTML document. If it was a GIF image file, the header would have been Content-type: image/gif.

IANA (Internet Assigned Numbers Authority), the group that maintains the lists of assigned protocol numbers and the like, is responsible for registering new content types. A current copy of the official MIME types is available from ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/. This site also has specifications or pointers to specifications for each type.
Getting Java to Load New Handlers

The exact procedure for loading a protocol or content handler depends on the Java implementation. The following instructions are based on Sun's Java Development Kit and should work for any implementation derived from Sun's. If you have problems, check the documentation for your particular version of Java.

In the JDK implementation, the URL class and helpers look for classes in the sun.net.www package. Protocol handlers should be in a package called sun.net.www.protocol.ProtocolName, where ProtocolName is the name of the protocol (such as ftp or http). The handler class itself should be named Handler. For example, the full name of the HTTP protocol handler class, provided by Sun with the JDK, is sun.net.www.protocol.http.Handler. To load your new protocol handler, you must construct a directory structure corresponding to the package names and add the directory to your CLASSPATH environment variable. Assume that you have a handler for a protocol--let's call it the foo protocol--and that your Java library directory is .../java/lib/ (...\java\lib\ on Windows machines). You must take the following steps to load the foo protocol:

1. Create the directories .../java/lib/sun, .../java/lib/sun/net, and so on. The last directory should be named like this: .../java/lib/sun/net/www/protocol/foo
2. Place your Handler.java file in the last directory. Name it like this: .../java/lib/sun/net/www/protocol/foo/Handler.java

3. Compile the Handler.java file.

If you place the netClass.zip file containing the network classes (located on the CD-ROM that accompanies this book) in your CLASSPATH, the sample handlers should load correctly.
Creating a Protocol Handler

Let's start extending Java with a handler for the finger protocol. The finger protocol is defined in RFC 762. The server listens on TCP port 79; it expects either the user name for which you want information followed by ASCII carriage return and linefeed characters, or (if you want information for all users currently logged in) just the carriage return and linefeed characters. The information is returned as ASCII text in a system-dependent format (although most UNIX variants give similar information). We will use an existing class (fingerClient) to handle contacting the finger server and concentrate on developing the protocol handler.
Design

The first decision we must make is how to structure URLs for our protocol. We'll imitate the HTTP URL and specify that finger URLs should be of the following format:

finger://host/user

In this syntax, host is the host to contact, and user is an optional user to ask for information about. If the user name is omitted, we will return information about all users.

Because Sun already provides a fingerClient class (sun.net.www.protocol.finger.fingerClient), we will rely on it to do the actual implementation of the finger protocol. We have to write the subclasses to URLStreamHandler and URLConnection. Our stream handler will use the client object to format the returned information using HTML. The handler will write the content into a StringBuffer object, which will be used to create a StringBufferInputStream. The fingerConnection--a subclass of URLConnection--will take this stream and implement the getInputStream() and getContent() methods.

In our implementation, the protocol stream handler object does all the work of retrieving the remote content; the connection object simply retrieves the data from the stream provided. Usually, the connection object handler would retrieve the content. The openConnection() method would open a connection to the remote location, and the getInputStream() method would return a stream to read the contents. In our case, the protocol is very simple (compared to something as complex as FTP or HTTP), and we can handle everything in the URLStreamHandler descendant.
The fingerConnection Source

The source for the fingerConnection class should go in the same file as the Handler class. The constructor copies the InputStream passed and calls the URLConnection constructor. It also sets the URLConnection member to indicate that the connection cannot take input. Listing 25.1 contains the source for this class.
Listing 25.1. The fingerConnection class.

class fingerConnection extends URLConnection { InputStream in; fingerConnection( URL u, InputStream in ) { super( u ); this.in = in; this.setDoInput( false ); } public void connect( ) { return; } public InputStream getInputStream( ) throws IOException { return in; } public Object getContent( ) throws IOException { String retval; int nbytes; byte buf[] = new byte[ 1024 ]; try { while( (nbytes = in.read( buf, 0, 1024 )) != -1 ) { retval += new String( buf, 0, 0, nbytes ); } } catch( Exception e ) { System.err.println( "fingerConnection::getContent: Exception\n" + e ); e.printStackTrace( System.err ); } return retval; }

}

NOTE: URLConnections normally go through a two-stage existence. First they are created, then they are connected. Separating the two stages allows a user to interact with the created object to specify input options such as request methods, cache usage, and the like. Once connected, these options can no longer be altered. The fingerConnection class was connected at creation time because of the work performed by the stream handler. Other URLConnection descendants may not operate the same way and so may allow user interaction between the times they are created and connected.

Handler Source

Let's rough out the skeleton of the Handler.java file. We need the package statement so that our classes are compiled into the package where the runtime handler will look for them. We also import the fingerClient object here. The outline of the class is shown in Listing 25.2.
Listing 25.2. The protocol handler skeleton.

package sun.net.www.protocol.finger; import java.io.*; import java.net.*; import sun.net.www.protocol.finger.fingerClient; // fingerConnection source goes here public class Handler extends URLStreamHandler { // openConnection() method

}

NOTE: Because the fingerConnection class appears with default visibility within the handler file, the URLStreamHandler descendant is the only class that has any knowledge of or access to the implementation. All access outside of the handler occurs through virtual methods of URLConnection--fingerConnection's parent class. This is not a requirement; the fingerConnection class could just as easily have existed as an external public class. Often, external existence is necessary to allow specific user input alterations that are not part of the URLConnection base class.

The openConnection() Method

Now let's develop the method responsible for returning an appropriate URLConnection object to retrieve a given URL. The method starts out by allocating a StringBuffer to hold our return data. We also will parse out the host name and user name from the URL argument. If the host was omitted, we default to localhost. The code for openConnection() is given in Listings 25.3 through 25.6.
Listing 25.3. The openConnection() method: Parsing the URL.

public synchronized URLConnection openConnection( URL u ) { StringBuffer sb = new StringBuffer( ); String host = u.getHost( ); String user = u.getFile( ).substring( 1, u.getFile( ).length() ); if( host.equals( "" ) ) { host = "localhost";

}

Notice how the connection class relies on the URL class for parsing. Other than its function as a gateway to the handlers, parsing is the main feature of the URL class.

Next, the method writes an HTML header into the buffer (see Listing 25.4). This enables a Java-based Web browser to display the finger information in a nice-looking format.
Listing 25.4. The openConnection() method: Writing the HTML header.

sb.append( "<HTML><head>\n"); sb.append( "<title>Fingering " ); sb.append( (user.equals("") ? "everyone" : user) ); sb.append( "@" + host ); sb.append( "</title></head>\n" ); sb.append( "<body>\n" ); sb.append( "<pre>\n" );

Now we'll use Sun's fingerClient class to get the information into a String and then append it to our buffer. If there is an error while getting the finger information, we will put the error message from the exception into the buffer instead (see Listing 25.5).
Listing 25.5. The openConnection() method: Retrieving the finger information.

try { String info = null; info = (new fingerClient( host, user )).getInfo( ); sb.append( info ); } catch( Exception e ) { sb.append( "Error fingering: " + e );

}

Finally, we'll close all the open HTML tags and create a fingerConnection object that will be returned to the caller (see Listing 25.6).
Listing 25.6. The openConnection() method: Finishing the HTML and returning a fingerConnection object.

sb.append( "\n</pre></body>\n</html>\n" ); return new fingerConnection( u, (new StringBufferInputStream( sb.toString( ) ) ) );

}

Using the Handler

Once all the code is compiled and in the right locations, load the urlFetcher applet provided on the CD-ROM that accompanies this book and enter a finger URL. If everything loads right, you should see something like Figure 25.1. If you get an error with a message such as BAD URL "finger://...": unknown protocol, check that you have your CLASSPATH set correctly.

Figure 25.1.

The urlFetcher applet displaying a finger URL.

Creating a Content Handler

The content handler example presented in this section is for MIME-type text and tab- separated values. If you have ever used a spreadsheet or database program, this type will be familiar. Many applications can import and export data in an ASCII text file, where each column of data in a row is separated by a tab character (\t). The first line is interpreted as the names of the fields, and the remaining lines are the actual data.
Design

Our first design decision is to figure out what type of Java object or objects to use to map the tab-separated values. Because this is textual content, some sort of String object would seem to be the best solution. The spreadsheet characteristics of rows and columns of data can be represented by arrays. Putting these two facts together gives us a data type of String[][], or an array of arrays of String objects. The first array is an array of String[] objects, each representing one row of data. Each of these arrays consists of a String for each cell of the data.

Because we also require some way of breaking the input stream into separate fields, we'll make a subclass of java.io.StreamTokenizer to handle this task. The StreamTokenizer class provides methods for breaking an InputStream into individual tokens.
Content Handler Skeleton

Content handlers are implemented by subclassing the java.net.ContentHandler class. These subclasses are responsible for implementing a getContent() method. We'll start with the skeleton of the class and then import the networking and I/O packages as well as the java.util.Vector class. We will also define the skeleton for our tabStreamTokenizer class. List-ing 25.7 shows the skeleton for this content handler.
Listing 25.7. Content handler skeleton.

/* * Handler for text/tab-separated-values MIME type. */ // This needs to go in this package for JDK-derived // Java implementations package sun.net.www.content.text; import java.net.*; import java.io.*; class tabStreamTokenizer extends StreamTokenizer { public static final int TT_TAB = ''\t' // Constructor } import java.util.Vector; public class tab_separated_values extends ContentHandler { // getContent method

}

The tabStreamTokenizer Class

Let's first define the class that breaks the input into separate fields. Most of the functionality we require is provided by the StreamTokenizer class, so we only have to define a constructor that specifies the character classes needed to get the behavior we want. For the purposes of this content handler, there are three types of tokens: TT_TAB tokens, which represent fields; TT_EOL tokens, which signal the end of a line (that is, the end of a row of data); and TT_EOF tokens, which signal the end of the input file. Because this class is relatively simple, it is presented in its entirety in Listing 25.8.
Listing 25.8. The tabStreamTokenizer class.

class tabStreamTokenizer extends StreamTokenizer { public static final int TT_TAB = '\t'; tabStreamTokenizer( InputStream in ) { super(new BufferedReader(new InputStreamReader(in)) ); // Undo parseNumbers() and whitespaceChars(0, ' ') ordinaryChars( '0', '9' ); ordinaryChar( '.' ); ordinaryChar( '-' ); ordinaryChars( 0, ' ' ); // Everything but TT_EOL and TT_TAB is a word wordChars( 0, ('\t'-1) ); wordChars( ('\t'+1), 255 ); // Make sure TT_TAB and TT_EOL get returned verbatim. whitespaceChars( TT_TAB, TT_TAB ); ordinaryChar( TT_EOL ); }

}

The getContent() Method

Subclasses of ContentHandler must provide an implementation of getContent() that returns a reference to an Object. The method takes as its parameter a URLConnection object from which the class can obtain an InputStream to read the resource's data.
The getContent() Skeleton

First, let's define the overall structure and method variables. We need a flag (which we'll call done) to signal when we've read all the field names from the first line of text. The number of fields (columns) in each row of data will be determined by the number of fields in the first line of text and will be kept in an int variable called numFields. We also will declare another integer, index, for use while inserting the rows of data into a String[].

We need some method of holding an arbitrary number of objects because we cannot determine the number of data rows in advance. To do this, we'll use the java.util.Vector object, which we'll call lines, to keep each String[] array. Finally, we will declare an instance of our tabStreamTokenizer, using the getInputStream() method from the URLConnection passed as an argument to the constructor. Listing 25.9 shows the skeleton code for the getContent() method.
Listing 25.9. The getContent() skeleton.

public Object getContent( URLConnection con ) throws IOException { boolean done = false; int numFields = 0; int index = 0; Vector lines = new Vector(); tabStreamTokenizer in = new tabStreamTokenizer( con.getInputStream( ) ); // Read in the first line of data (Listing 25.10 & 25.11) // Read in the rest of the file (Listing 25.12) // Stuff all data into a String[][] (Listing 25.13)

}

Reading the First Line

The first line of the file tells us the number of fields and the names of the fields in each row for the rest of the file. Because we don't know beforehand how many fields there are, we'll keep each field in Vector firstLine. Each TT_WORD token that the tokenizer returns is the name of one field. We know we are done once it returns a TT_EOL token and can set the done flag to true. We use a switch statement on the ttype member of our tabStreamTokenizer to decide what action to take (see Listing 25.10).
Listing 25.10. Reading the first line of data.

Vector firstLine = new Vector( ); while( !done && in.nextToken( ) != in.TT_EOF ) { switch( in.ttype ) { case in.TT_WORD: firstLine.addElement( new String( in.sval ) ); numFields++; break; case in.TT_EOL: done = true; break; }

}

Now that we have the first line in memory, we have to build an array of String objects from those stored in the Vector. To accomplish this, we'll first allocate the array to the size just determined. Then we will use the copyInto() method to transfer the strings into the array just allocated. Finally, we'll insert the array into lines (see Listing 25.11).
Listing 25.11. Copying field names into an array.

// Copy first line into array String curLine[] = new String[ numFields ]; firstLine.copyInto( curLine ); lines.addElement( curLine );

Read the Rest of the File

Before reading the remaining data, we have to allocate a new array to hold the next row. Then we loop until we encounter the end of the file, signified by TT_EOF. Each time we retrieve a TT_WORD, we insert the String into curLine and increment index.

The end of the line lets us know when a row of data is done, at which time we copy the current line into Vector. Then we allocate a new String[] to hold the next line and set index back to zero (to insert the next item starting at the first element of the array). The code to implement this is given in Listing 25.12.
Listing 25.12. Reading the rest of the data.

curLine = new String[ numFields ]; while( in.nextToken( ) != in.TT_EOF ) { switch( in.ttype ) { case in.TT_WORD: curLine[ index++ ] = new String( in.sval ); break; case in.TT_EOL: lines.addElement( curLine ); curLine = new String[ numFields ]; index = 0; break; }

}

Stuff All Data into String[][]

At this point in the code, all the data has been read in. All that remains is to copy the data from lines into an array of arrays of String, as shown in Listing 25.13.
Listing 25.13. Returning tab-separated value (TSV) data as String[][].

String retval[][] = new String[ lines.size() ][]; lines.copyInto( retval ); return retval;

Using the Content Handler

To show how the content handler works, we'll modify the urlFetcher applet (used earlier in this chapter to demonstrate the finger protocol handler). We'll change it to use the getContent() method to retrieve the contents of a resource rather than reading the data from the stream returned by getInputStream(). We'll show the changes to the doFetch() method of the urlFetcher applet necessary to determine what type of Object was returned and to display it correctly. The first change is to call the getContent() method and get an Object back rather than getting an InputStream. Listing 25.14 shows this change.
Listing 25.14. Modified urlFetcher.doFetch() code: Calling getContent() to get an Object.

try { boolean displayed = false; URLConnection con = target.openConnection(); Object obj = con.getContent( );

Next we must perform tests using the instanceof operator. We handle String objects and arrays of String objects by placing the text into the TextArea. Arrays are printed item by item. If the object is a subclass of InputStream, we read the data from the stream and display it. Image content is just noted as being an Image. For any other content type, we simply throw our hands up and remark that we cannot display the content (because the urlFetcher applet is not a full-fledged Web browser). The code to do this is shown in Listing 25.15.
Listing 25.15. Modified urlFetcher.doFetch() code: Determining the type of the Object and displaying it.

if( obj instanceof String ) { contentArea.setText( (String) obj ); displayed = true; } if( obj instanceof String[] ) { String array[] = (String []) obj; StringBuffer buf = new StringBuffer( ); for( int i = 0; i < array.length; i++ ) buf.append( "item " + i + ": " + array[i] + "\n" ); contentArea.setText( buf.toString( ) ); displayed = true; } if( obj instanceof String[][] ) { String array[][] = (String [][]) obj; StringBuffer buf = new StringBuffer( ); for( int i = 0; i < array.length; i++ ) { buf.append( "Row " + i + ":\n\t" ); for( int j = 0; j < array[i].length; j++ ) buf.append( "item " + j + ": " + array[i][j] + "\t" ); buf.append( "\n" ); } contentArea.setText( buf.toString() ); displayed = true; } if( obj instanceof Image ) { contentArea.setText( "Image" ); diplayed = true; } if( obj instanceof InputStream ) { int c; StringBuffer buf = new StringBuffer( ); while( (c = ((InputStream) obj).read( )) != -1 ) buf.append( (char) c ); contentArea.setText( buf.toString( ) ); displayed = true; } if( !displayed ) { contentArea.setText( "Don't know how to display " obj.getClass().getName( ) ); } // Same code to display content type and length } catch( IOException e ) { showStatus( "Error fetching \"" + target + "\": " + e ); return;

}

The complete modified applet source is on the CD-ROM that accompanies this book as urlFetcher_Mod.java. Figure 25.2 shows what the applet will look like when displaying text/tab-separated values. The file displayed in the figure is included on the CD-ROM as example.tsv.

Figure 25.2.

The urlFetcher_Mod applet.

Most HTTP daemons should return the correct content type for files ending in .tsv. Many Web browsers have a menu option that shows you information such as the content type about a URL (for example, the View | Document Info option in Netscape Navigator does this). You can use this feature to see what MIME type the sample data is being returned as. If the data does not show up as text/tab-separated values, try one of the following suggestions:

Ask your Webmaster to look at the MIME configuration file for your HTTP daemon. The Webmaster will either be able to tell you the proper file suffix or modify the daemon to return the proper type.

If you can install CGI scripts on your Web server, you may want to look at a sample script on the CD-ROM that accompanies this book (named tsv.sh); it has the content handler example that returns data in the proper format.

Summary

After reading this chapter, you should have an understanding of how Java can be extended fairly easily to deal with new application protocols and data formats. You now know the classes from which you have to derive your handlers (URLConnection and URLStreamHandler for protocol handlers; ContentHandler for content handlers) and how to get Java to load the new handler classes.

©Copyright, Macmillan Computer Publishing. All rights reserved.

Java 1.1 Unleashed

- 25 - Developing Content and Protocol Handlers

What Are Protocol and Content Handlers?

MIME Types

Getting Java to Load New Handlers

Creating a Protocol Handler

Design

The fingerConnection Source

Listing 25.1. The fingerConnection class.

Handler Source

Listing 25.2. The protocol handler skeleton.

The openConnection() Method

Listing 25.3. The openConnection() method: Parsing the URL.

Listing 25.4. The openConnection() method: Writing the HTML header.

Listing 25.5. The openConnection() method: Retrieving the finger information.

Listing 25.6. The openConnection() method: Finishing the HTML and returning a fingerConnection object.

Using the Handler

Creating a Content Handler

Design

Content Handler Skeleton

Listing 25.7. Content handler skeleton.

The tabStreamTokenizer Class

Listing 25.8. The tabStreamTokenizer class.

The getContent() Method

The getContent() Skeleton

Listing 25.9. The getContent() skeleton.

Reading the First Line

Listing 25.10. Reading the first line of data.

Listing 25.11. Copying field names into an array.

Read the Rest of the File

Listing 25.12. Reading the rest of the data.

Stuff All Data into String[][]

Listing 25.13. Returning tab-separated value (TSV) data as String[][].

Using the Content Handler

Listing 25.14. Modified urlFetcher.doFetch() code: Calling getContent() to get an Object.

Listing 25.15. Modified urlFetcher.doFetch() code: Determining the type of the Object and displaying it.

Summary

- 25 -
Developing Content and Protocol Handlers