Chapter 23
Communications and Networking

by David W. Baker

Models around which network protocols are designed Understanding the OSI and TCP/IP models will provide you with a practical framework for understanding network communications.
The implementation of the TCP/IP network layers, including IP, TCP, and UDP Each of the protocols that correspond to the four TCP/IP network layers works in concert to provide structure and integrity to your applications which use network communications.
The network identification scheme of URLs Developed for the World Wide Web, URLs were designed with the foresight to encompass legacy protocols and allow for incorporation of future designs.
Java's URL-oriented objects Classes such as java.net.URL and java.net.URLConnection enable you to quickly begin programming with standard application protocols.

Despite all of its other merits, the rapid embrace of Java by the computing community is primarily due to its powerful integration with Internet networking. The Internet revolution has forever changed the way the personal computer is used, empowering individuals to gather, share, and publish information in a vast resource with millions of participants. Building on top of this foundation, Java could be the next major revolution in computing.

The Java execution environment is designed so that applications can be easily written to efficiently communicate and share processing with remote systems. Much of this functionality is provided with the standard Java API within the java.net package.

This is the first of five chapters which will demonstrate clear and practical uses of the classes within java.net, explaining the programming concepts on which they are based. As a foundation of these discussions, the design of the Internet network protocol suite--TCP/IP--is illustrated within this chapter.

Overview of TCP/IP

TCP/IP is a suite of protocols that interconnects the various systems on the Internet. TCP/IP provides for a common programming interface to diverse and foreign hardware. The suite supports the joining of separate physical networks implementing different network media. To wit, TCP/IP makes a diverse, chaotic, global network like the Internet possible.

Models provide useful abstractions of working systems, ignoring fine detail while enabling a clear perspective on global interactions. Models facilitate a greater understanding of functioning systems and also provide a foundation for extending that system. Understanding the models of network communications is an essential guide to learning TCP/IP fundamentals.

OSI Reference Model

The network protocol architecture known as the Open Systems Interconnect (OSI) Reference Model is often used to describe network systems. The OSI scheme was one part of a larger project by the International Organization for Standardization (ISO). The OSI protocols never proved as successful as TCP/IP, making the Reference Model perhaps the most enduring aspect of this ISO endeavor.

The model consists of seven layers providing specific functionality. Each layer has defined characteristics, and together the whole enables network communication. The software implementation of such a layered model is appropriately termed a protocol stack.

The OSI model is illustrated in Figure 23.1. User applications insert information into one layer and each specially encapsulates the data until the last is reached. The information is then transmitted to the destination, sometimes having the layers translated from the bottom up as the data is transported.

FIG. 23.1
The OSI Reference Model consists of seven layers.

The following layers have specific roles, each refraining from intruding into the domain of the other, all depending upon the others:

Application Layer. Contains network applications with which people interact, such as mail, file transfer, and remote login.
Presentation Layer. Creates common data structures.
Session Layer. Manages connections between network applications.
Transport Layer. Ensures that data is received exactly as it is sent.
Network Layer. Routes data through various physical networks while traveling to a known host.
Data Link Layer. Transmits and receives packets of information reliably across a uniform physical network.
Physical Layer. Defines the physical properties of the network, such as voltage levels, cable types, and interface pins.

TCP/IP Network Model

The OSI model informs an understanding of the TCP/IP communication architecture. When viewed as a layered model, TCP/IP is usually seen as being composed of four layers:

Application
Network
Transport
Link

These layers are illustrated in Figure 23.2. Attempts to map these layers to the OSI model are inexact and confuse matters, so this chapter refrains from such an endeavor.

FIG. 23.2
The TCP/IP Network Model can be broken down into four layers.

As in the OSI model, each TCP/IP layer plays a specific role.

Application Layer Network applications depend upon the definition of a clear dialog. In a client-server system, the client application knows how to request services, and the server knows how to appropriately respond. Protocols that implement this layer include HTTP, FTP, and Telnet.

Transport Layer The Transport Layer allows network applications to obtain messages over clearly defined channels and with specific characteristics. The two protocols within the TCP/IP suite that generally implement this layer are Transmission Control Protocol (TCP) and User Datagram Protocol (UDP).

Network Layer The Network Layer allows information to be transmitted to any machine on the contiguous TCP/IP network, regardless of the different physical networks that intervene. Internet Protocol (IP) is the mechanism for transmitting data within this layer.

Link Layer The Link Layer consists of the low-level protocols used to transmit data to machines on the same physical network. Protocols that aren't part of the TCP/IP suite, such as Ethernet, Token Ring, FDDI, and ATM, implement this layer. Data within these layers is usually encapsulated with a common mechanism: protocols have a header, identifying meta-information such as the source, destination, and other attributes, and a data portion that contains the actual information. The protocols from the upper layers are encapsulated within the data portion of the lower ones. When traveling back up the protocol stack, the information is reconstructed as it is delivered to each layer. Figure 23.3 shows this concept of encapsulation.

FIG. 23.3
As data moves through the TCP/IP layers, it is encapsulated.

TCP/IP Protocols

Three protocols are most commonly used within the TCP/IP scheme, and a closer investigation of their properties is warranted. Understanding how these three protocols (IP, TCP, and UDP) interact is critical to developing network applications.

Internet Protocol (IP)

IP is the keystone of the TCP/IP suite. All data on the Internet flows through IP packets, the basic unit of IP transmissions. IP is termed a connectionless, unreliable protocol. As a connec-tionless protocol, IP does not exchange control information before transmitting data to a remote system--packets are merely sent to the destination with the expectation that they will be treated properly. IP is unreliable because it does not retransmit lost packets or detect corrupted data. These tasks must be implemented by higher level protocols, such as TCP.

IP defines a universal addressing scheme called IP addresses. An IP address is a 32-bit number, and each standard address is unique on the Internet. Given an IP packet, the information can be routed to the destination based upon the IP address defined in the packet header. IP addresses are generally written as four numbers, between 0 and 255, separated by a period (for example, 124.148.157.6).

While a 32-bit number is an appropriate way to address systems for computers, humans understandably have difficulty remembering them. Thus, a system called the Domain Name System (DNS) was developed to map IP addresses to more intuitive identifiers and vice versa. You can use www.netspace.org instead of 128.148.157.6.

It is important to realize that these domain names are not used or understood by IP. When an application wants to transmit data to another machine on the Internet, it must first translate the domain name to an IP address using the DNS. A receiving application can perform a reverse translation, using the DNS to return a domain name given an IP address. There is not a one-to-one correspondence between IP addresses and domain names: a domain name can map to multiple IP addresses, and multiple IP addresses can map to the same domain name.

NOTE: Even more important to note is that the entire body of DNS data cannot be trusted. Varied systems through the world are responsible for maintaining DNS records. DNS servers can be tricked, and servers can be set up that are populated with false information. In fact, a security hole in early Java implementations was created by an inappropriate trust of the DNS.

Transmission Control Protocol (TCP)

Most Internet applications use TCP to implement the transport layer. TCP provides a reliable, connection-oriented, continuous-stream protocol. The implications of these characteristics are:

Reliable. When TCP segments, the smallest unit of TCP transmissions, are lost or corrupted, the TCP implementation will detect this and retransmit necessary segments.
Connection-oriented. TCP sets up a connection with a remote system by transmitting control information, often known as a handshake, before beginning a communication. At the end of the connect, a similar closing handshake ends the transmission.
Continuous-stream. TCP provides a communications medium that allows for an arbitrary number of bytes to be sent and received smoothly; once a connection has been established, TCP segments provide the application layer the appearance of a continuous flow of data.

Because of these characteristics, it is easy to see why TCP would be used by most Internet applications. TCP makes it very easy to create a network application, freeing you from worrying how the data is broken up or about coding error correction routines. However, TCP requires a significant amount of overhead and perhaps you might want to code routines that more efficiently provide reliable transmissions, given the parameters of your application. Furthermore, retransmission of lost data may be inappropriate for your application, because such information's usefulness may have expired. In these instances, UDP serves as an alternative, described in the following section, "User Datagram Protocol (UDP)."

An important addressing scheme which TCP defines is the port. Ports separate various TCP communications streams that are running concurrently on the same system. For server applications, which wait for TCP clients to initiate contact, a specific port can be established from where communications will originate. These concepts come together in a programming abstraction known as sockets.

See "TCP Socket Basics," Chapter 24

User Datagram Protocol (UDP)

UDP is a low-overhead alternative to TCP for host-to-host communications. In contrast to TCP, UDP has the following features:

Unreliable. UDP has no mechanism for detecting errors, nor retransmitting lost or corrupted information.
Connectionless. UDP does not negotiate a connection before transmitting data. Information is sent with the assumption that the recipient will be listening.
Message-oriented. UDP allows applications to send self-contained messages within UDP datagrams, the unit of UDP transmission. The application must package all information within individual datagrams.

For some applications, UDP is more appropriate than TCP. For instance, with the Network Time Protocol (NTP), lost data indicating the current time would be invalid by the time it was retransmitted. In a LAN environment, Network File System (NFS) can more efficiently provide reliability at the application layer and thus uses UDP.

As with TCP, UDP provides the addressing scheme of ports, allowing for many applications to simultaneously send and receive datagrams. UDP ports are distinct from TCP ports. For example, one application can respond to UDP port 512 while another unrelated service handles TCP port 512.

Uniform Resource Locator (URL)

While IP addresses uniquely identify systems on the Internet, and ports identify TCP or UDP services on a system, URLs provide a universal identification scheme at the application level. Anyone who has used a Web browser is familiar with seeing URLs, though their complete syntax may not be self-evident. URLs were developed to create a common format of identifying resources on the Web, but they were designed to be general enough so as to encompass applications that predated the Web by decades. Similarly, the URL syntax is flexible enough so as to accommodate future protocols.

URL Syntax

The primary classification of URLs is the scheme, which usually corresponds to an application protocol. Schemes include http, ftp, telnet, and gopher. The rest of the URL syntax is in a format that depends upon the scheme. These two portions of information are separated by a colon:

scheme-name:scheme-info

Thus, while mailto:dwb@netspace.org indicates "send mail to user dwb at the machine netspace.org," ftp://dwb@netspace.org/ means "open an FTP connection to netspace.org and log in as user dwb."

General URL Format

Most URLs conform to a general format that follows this pattern:

scheme-name://host:port/file-info#internal-reference

Scheme-name is an URL scheme such as HTTP, FTP, or Gopher. Host is the domain name or IP address of the remote system. Port is the port number on which the service is listening; because most application protocols define a standard port, unless a non-standard port is being used, the port and the colon which delimits it from the host are omitted. File-info is the resource requested on the remote system, which often is a file. However, the file portion may actually execute a server program and it usually includes a path to a specific file on the system. The internal-reference is usually the identifier of a named anchor within an HTML page. A named anchor allows a link to target a particular location within an HTML page. Usually this is not used, and this token with the # character that delimits it is omitted.

Realize that this general format is very much an over-simplification which only agrees with common use. For more complete information on URLs, read the following resource:

http://www.netspace.org/users/dwb/url-guide.html

Java and URLs

Java provides a very powerful and elegant mechanism for creating network client applications, allowing you to use relatively few statements to obtain resources from the Internet. The java.net package contains the sources of this power, the URL and URLConnection classes.

NOTE: The Security Manager of Java browsers generally prohibits applets from opening a network connection to a machine other than the one from which the applet was downloaded. This security feature significantly limits what applets can accomplish. This holds true for all Java networking described in this and subsequent chapters. Java applications, however, are under no such restrictions.

The URL Class

This class allows you to easily create a data structure containing all of the necessary information to obtain the remote resource. Once an URL object has been created, you can obtain the various portions of the URL, according to the general format. The URL object also allows you to obtain the remote data.

The URL class has four constructors:

public URL(String spec) throws MalformedURLException;
public URL(String protocol, String host, String file)
   throws MalformedURLException;
public URL(String protocol, String host, int port, String file)
   throws MalformedURLException;
public URL(URL context, String spec)
   throws MalformedURLException;

The first constructor is the most commonly used and allows you to create an URL object with a simple declaration like:

URL myURL = new URL("http://www.yahoo.com/");

The second and third constructors allow you to specify explicitly the various portions of the URL. The last constructor enables you to use relative URLs. A relative URL only contains part of the URL syntax; the rest of the data is completed from the URL to which the resource is relative. This will often be seen in HTML pages, where a reference to merely more.html means "get more.html from the same machine and directory where the current document resides."

Here are examples of these constructors:

URL firstURLObject - new URL("http://www.yahoo.com/");
URL secondURLObject = new URL("http","www.yahoo.com","/");
URL thirdURLObject =     new URL("http","www.yahoo.com",80,"/");
URL fourthURLObject = new URL(firstURLObject,"text/suggest.html");

The first three statements create URL objects that all refer to the Yahoo! home page, while the fourth creates a reference to "text/suggest.html" relative to Yahoo's home page (such as http://www.yahoo.com/text/suggest.html). All of these constructors throw a MalformedURLException, which you will generally want to catch. The example shown later in Listing 23.1 illustrates this. Note that once you create an URL object, you can change to which resource it points. To accomplish this, you must create a new URL object.

Connecting to an URL

Now that you've created an URL object, you will want to actually obtain some useful data. There are two main avenues of so doing: reading directly from the URL object or obtaining an URLConnection instance from it.

Reading directly from the URL object requires less code, but is much less flexible, and it only allows a read-only connection. This is limiting, as many Web services allow you to write information that will be handled by a server application. The URL class has an openStream() method which returns an InputStream object through which the remote resource can be read byte-by-byte.

Handling data as individual bytes is cumbersome, so you will often want to embed the returned InputStream within a DataInputStream object, allowing you to read the input line-by-line. This coding strategy is often referred to as using a decorator, as the DataInputStream decorates the InputStream by providing a more specialized interface. The following code fragment obtains an InputStream directly from the URL object and then decorates that stream:

URL whiteHouse = new URL("http://www.whitehouse.gov/");
InputStream undecoratedInput = whiteHouse.openStream();
DataInputStream decoratedInput =     new DataInputStream(undecoratedInput);

Another more flexible way of connecting to the remote resource is by using the openConnection() method of the URL class. This method returns an URLConnection object that provides a number of very powerful methods that you can use to customize your connection to the remote resource.

For example, unlike the URL class, an URLConnection allows you to obtain both an InputStream and an OutputStream. This has a significant impact upon the HTTP protocol, whose access methods include both GET and POST. With the GET method, an application merely requests a resource and then reads the response. The POST method is often used to provide input to server applications by requesting a resource, writing data to the server with the HTTP request body, and then reading the response. In order to use the POST method, you can write to an OutputStream obtained from the URLConnection prior to reading from the InputStream. If you read first, the GET method will be used and a subsequent write attempt will be invalid.

The following code fragment demonstrates using an URLConnection object to contact a remote server application using the HTTP POST method by writing to an OutputStream decorated by a PrintStream instance. www.javasoft.com makes a CGI server application available to test out these methods. The code connects to a CGI application which reverses the POST data and then reads the reversed data from a decorated InputStream.

URL reverseURL =
   new URL("http://www.javasoft.com/cgi-bin/backwards");
URLConnection reverseConn = reverseURL.openConnection();
PrintStream output =
   new PrintStream(reverseConn.getOutputStream());
DataInputStream input =
   new DataInputStream(reverseConn.getInputStream());
output.println("string=TexttoReverse");
String reversedText = input.readLine();

HTTP-Centric Classes

After reading this overview of the URL and URLConnection classes, you may begin to suspect that the methods of these classes are designed with the HTTP protocol in mind. If you look at the complete class specifications, this notion is confirmed. Though the http scheme is only one of many classifications of URLs, these classes are very HTTP-centric.

HTTP is likely to be the most used standard protocol for your communications on the Web and Internet, so this is not a significant concern. You should be aware, however, that many of these methods are useful only when working with HTTP URLs.

An Example--Customized AltaVista Searching

Now that you've learned the basics of Java networking, it would be nice to do something actually useful. The AltaVista search engine provides a very powerful way of searching for documents on the Web; however, it is designed to only return a few hits at a time to optimize performance. As other astute and sagacious programmers have pointed out, developing a small client to automatically request all of the search results and then present them at once is very useful. The AltaVistaList Java application does just that.

TIP: AltaVista is a powerful search engine that allows you to find documents on the Web. The AltaVista home page is available at: http://www.altavista.digital.com/

When designing this application, first consider what public methods this class should have. It needs:

A method to start the application.
A method to initialize the object.
A method to print out all of the results.
A number of protected methods to be used by the object itself: a method to create a query to send to AltaVista, a method to get a single HTML page from AltaVista, and a method to parse out the hit results from the rest of the returned HTML page.

Listing 23.1 shows the entire code of the AltaVistaList application. When executed with a series of keywords, it returns an HTML page that contains all of the hits returned by AltaVista.

CAUTION:
Note that this application is limited by the specific mechanisms AltaVista uses to receive and present information. These mechanisms are not guaranteed to remain static--thus, if AltaVista changes, this program fails.

Listing 23.1AltaVistaList.java

import java.net.*;   // Import the names of the classes
import java.io.*;    // to be used.

/**
 * This application creates a single, concise HTML page
 * of hits from the AltaVista search engine given a
 * search string.
 * @author David W. Baker
 * @version 1.1
 */
public class AltaVistaList {
   private static final String AGENT_NAME =
      "java-alta-search";
   private static final String AGENT_VERSION = "1.0";
   private static final String SEARCH_URL =
       "http://www.altavista.digital.com/cgi-bin/query";
   private int totalHits = 0;
   private StringBuffer outputList = new StringBuffer();

   /**
    * This starts the application.
    * @param args Program arguments - the search string.
    */
   public static void main(String[] args) {
      if (args.length == 0) {
         System.out.println(
            "Usage: AltaVistaList search string");
         System.exit(1);
      }
      AltaVistaList runApp = new AltaVistaList(args);
      runApp.printOutput(System.out);
      System.exit(0);
   }

   /**
    * This constructor connects to AltaVista and obtains
    * all of the relevant hits.
    * @param args The search tokens.
    */
   public AltaVistaList(String[] args) {
      String hitData;      // Store incoming data.
      int startHits = 0;   // Get the next 10 hits from here.
      String searchSyntax = createQuery(args);

      URLConnection.setDefaultRequestProperty("User-Agent",
         AGENT_NAME + "/" + AGENT_VERSION);
      while (true) {
         hitData = getPage(SEARCH_URL + "?" + searchSyntax +
            startHits); // Go get a page of hits.
         hitData = getHits(hitData);  // Extract the hits.
         // If there were no hits in the page, hitData will
         // be null. If there were hits, append them to
         // the outputList, increment to the next 10 hits,
         // and go through the loop again.
         if (hitData != null) {
            outputList.append(hitData + "\n");
            startHits += 10;
         // Otherwise, break from the loop.
         } else {
            break;
         }
      }
   }

   /**
    * This method builds an AltaVista search query
    * string.
    * @param searchTokens An array of search tokens.
    * @return The search query string built.
    */
   protected String createQuery(String[] searchTokens) {
      StringBuffer searchString = new StringBuffer();

      // Apend the tokens to a single string.
      for(int index = 0; index < searchTokens.length;
            index++) {
         searchString.append(searchTokens[index]);
         // Add a space if there's another token coming up.
         if (index < searchTokens.length-1) {
            searchString.append(" ");
         }
      }
      // URL encode the string.
      String encodedSearchString =
         URLEncoder.encode(searchString.toString());
      // Return the proper query string.
      return "what=web&fmt=c&pg=q&q=" + encodedSearchString
         + "&stq=";
   }

   /**
    * This method obtains a page from the Web.
    * @param url The URL of the page to obtain.
    * @return The page obtained.
    */
   protected String getPage(String url) {
      // A buffer for the incoming page.
      StringBuffer page = new StringBuffer();
      String nextLine;  // The next line in the input stream.

      try {
         URL urlObject = new URL(url);
         URLConnection agent = urlObject.openConnection();
         DataInputStream input =
            new DataInputStream(agent.getInputStream());
         // While readLine() doesn't return null, append
         // then next line to the buffer.
         while((nextLine = input.readLine()) != null) {
            page.append(nextLine+"\n");
         }
         input.close();
      } catch(MalformedURLException excpt) {
         System.out.println("Badly formed URL: " + excpt);
      } catch(IOException excpt) {
         System.out.println("Failed I/O: " + excpt);
      }
      // Convert the buffer to a string and return.
      return page.toString();
   }

   /**
    * This method extracts the list of hits from a returned
    * AltaVista results page.
    * @param hitPage The page returned from AltaVista.
    * @return The list of hits.
    */
   protected String getHits(String hitPage) {
      int first,last;     // Begin/end of a substring.
      int notFound = -1;  // Not found return for indexOf().
      String hitSection = null;  // The hits part of page.

      // Go to the first "<a href=" after "<pre>".
      first = hitPage.indexOf("<pre>") + "<pre>".length();
      first = hitPage.indexOf("<a href=",first);
      // End pointer at "</pre>".
      last = hitPage.indexOf("</pre>");

      // If our beginning is after our end, return.
      if (last < first) {
         return hitSection;
      }
      // If neither substring is found, return.
      if (first == notFound || last == notFound) {
         System.err.println("Bad search page format");
         return hitSection;
      }

      // Cut out the substring.
      hitSection = hitPage.substring(first,last);
      first = last = 0;
      totalHits += 1;   // Found one hit.
      // Go through the page line by line.
      while((last = hitSection.indexOf("\n",first))
               != notFound) {
         // Find the next "<a href=" which should be
         // immediately after the \n.
         first = hitSection.indexOf("<a href=",last);
         // If it's not, return the current substring.
         if (first != (last+1)) {
            return hitSection.substring(0,last);
         // Otherwise, another hit has been found.
         } else {
            totalHits += 1;
         }
      }
      return hitSection; // Return the substring.
   }

   /**
    * This method prints the list of hits obtained from
    * AltaVista.
    * @param sendOutput Where to print the output.
    */
   public void printOutput(PrintStream sendOutput) {
      sendOutput.print("<!DOCTYPE HTML PUBLIC \"-//IETF//" +
         "DTD HTML//EN\">\n<HTML>\n<HEAD>\n<TITLE>" +
         AGENT_NAME + "</TITLE>\n</HEAD>\n<BODY>\n<H1>" +
         "Search Results</H1>\n<P><STRONG>Total number of " +
         "hits: " + totalHits + "</STRONG></P>\n<PRE>\n" +
         outputList + "</PRE>\n</BODY>\n</HTML>\n");
   }
}

The main() Method: Starting the Application First, the application imports the two packages it will be using, allowing its method invocations of the Java API to be more brief. Then the class is declared and a number of private instance variables are initialized: search_URL is the URL to the AltaVista search engine, totalHits maintains a count of the hits returned, while outputList is a buffer for the HTML of the returned hits. The main() method allows this code to be executed as an application. This method checks for an appropriate set of arguments and creates an instance of the AltaVistList class. It then tells that instance to print its output, passing the printOutput() method System.out. System.out is a static reference to a PrintStream object within the java.lang.System class. By passing this PrintStream to printOutput(), it indicates that the AltaVistaList instance should send its data to standard output. Finally, main() exits the application with a return value of 0, indicating that the execution completed normally.

The AltaVistaList Constructor The class' only constructor takes an argument of a reference to an array of String objects. It passes this array reference to the createQuery() method, described below, which builds a query for the AltaVista search engine. The constructor then invokes a static method of the URLConnection class, setDefaultRequestProperty():

URLConnection.setDefaultRequestProperty("User-Agent",
AGENT_NAME + "/" + AGENT_VERSION);

Note that because this is a static method, it is not invoked through an instance of URLConnection. Instead, the method is invoked through the class itself. This method indicates that all HTTP requests should specify the "User-Agent" field as being equal to a string that identifies this application. The HTTP User-Agent is a field that Web clients use to tell servers their program name, allowing servers to keep track of what clients are visiting the site. For instance, Netscape browsers send a User-Agent field that includes "Mozilla," while the Internet Explorer sends "Explorer." Unfortunately, as of the writing of this chapter, the JDK has yet to implement this method, and the User-Agent remains the default Java<version>. This code has been left in with the assumption that the JDK will soon complete its implementation of the Java API.

The constructor then enters an infinite loop. In this loop, it calls the getPage() method with an URL of a page to receive. This URL is the URL of the AltaVista search engine appended by a question mark, the query returned by createQuery(), and a number from where the hit list should start. AltaVista lists hits ten at a time, and the startHits counter allows the AltaVistaList application to increment throughout the entire list of hits. The returned page is passed to getHits(), which strips out the hits from the rest of the HTML page. If what remains is a null String reference, the application breaks from the loop. Otherwise, it appends the data to a StringBuffer and then increments the startHits counter to set up for retrieving the next ten hits.

The createQuery() Method: Building the Query String This protected method is used internally by an instance of the AltaVistaList class, building the query syntax for the AltaVista search engine. This query string follows this format:

what=web&fmt=c&pg=q&q=<search string>&stq=<n>

While this appears confusing, this is merely a set of five parameters separated by ampersands. what=web tells AltaVista to search its index of Web pages. fmt=c asks for output to be returned in a compact format, facilitating parsing and efficient presentation. pq=q indicates that you are doing a simple query using the basic syntax language. q= identifies your search string, which must be encoded in the URL format, encoding spaces, and other special characters. Finally, stq= tells AltaVista which query result item to start from when returning its data. AltaVista returns results ten at a time, and stq= allows you to obtain the results after the first ten.

createQuery() takes an array of String objects and then appends the entire array to a StringBuffer object. The contents of each String are separated by a space. Another static method is used, this time encode() from the URLEncoder class. This is a very useful method:

String encodedSearchString =
    URLEncoder.encode(searchString.toString());

The URLEncoder.encode() static method takes a String as an argument and returns a corresponding String that is in URL encoded format. This format allows spaces and other special characters to be encapsulated within an URL.

createQuery() then returns the appropriate query with this encoded String embedded. Note that createQuery() omits the number for the stq= parameter, as that information is appended within the main() method.

The getPage() Method: Retrieving a Web Page getPage() is a method that demonstrates the concepts learned to this point regarding Java and networking. Passed a String that contains an URL, the method creates an URL instance and then obtains an URLConnection from that instance. It sets up a DataInputStream to read data from that connection line-by-line, and then enters a loop. This while loop reads the next line of data, exiting the loop if the next line is null. The method appends each line to a StringBuffer object, and once completed, uses that class' toString() method to return a String.

The getHits() Method: Parsing Out the Hits The logic of this method is a little hard to follow, but its goal is to take a String containing an HTML page from the AltaVista search engine and strip out the returned hits. It accomplishes this by using two pointers, first and last, to indicate the beginning and end of appropriate substrings within the page. getHits() looks for the first instance of <a href= after the first occurrence of <pre>, which is where the hit list should begin. The end of the hit list should be at the first </pre> tag. If no such string is found, the method returns with a null String reference, indicating that the page was devoid of hits.

Otherwise, the method pares down the HTML page to a substring and iterates through that data. Each hit should be a new line starting with <a href=. While this is the case, the method keeps looping until it comes to the end of the data. Each line that it encounters indicates a new hit has been found, and the method increments an instance variable used to keep track of the total number of hits. Once completed, the method returns the String containing the hits.

The printOutput Method: Displaying the Results The printOutput() method is very simple. It takes a PrintStream as an argument and then prints an HTML page to that stream. Instead of hard-coding System.out, this method is flexible and allows output to be easily directed to some other stream. The HTML page printed includes the total number of hits and then a preformatted section containing each hit on a separate line.

Running AltaVistaList To run the AltaVistaList application, first compile it with javac. Then, execute it with the Java interpreter, passing it an appropriate set of arguments. For instance, to see what additional information is available on the Web with both Java and URL on the same page, use the following command:

java AltaVistaList Java URL

Chapter 23 Communications and Networking