Chapter 11 Searching and CGI

Searching Information on the Web
Most Important Search Engines
Gathering Information on the Internet
Searching Interfaces for the Final User
CGI Work in the Background
Developing a Simple CGI for a White Pages Database
Future Improvements
Summary

This chapter covers major search engines on the World Wide Web. We'll cover different search techniques and the use of CGI applications on a search engine. We will not create a complex search engine, but hope to give you some ideas on the use and importance of CGI applications on search engines, illustrated by a simple White Pages application presented at the end of the chapter. A White Pages database is a list of e-mail addresses. This one was developed by using the CGI specifications and the Perl language and has a simple Web interface to let users submit queries.

Searching Information on the Web

Exploring the World Wide Web can be an enjoyable task, but can also become frustrating if your search doesn't reward you with anything of value after several hours of searching. The Web was designed to provide easy access to all types of information and, like the whole Internet, it is also a vast information platform. Since its creation in 1990, the Web has been growing so quickly that it has become nearly impossible for one to use it correctly without specialized tools. These tools have developed over time and are generally referred to as search engines, which help users in the organization and retrieval of information.

Most Important Search Engines

Web search engines appeared a few months after the creation of the Web itself and were developed to meet the need for information organization and fast-retrieval.

Back in October 1993, when there were about 200 known Web servers, it was possible for a human to have a general idea of what one could find on the Web. But some months later, the number of known Web servers increased to 1500 (as of June 1994). Finding information without any help was starting to become difficult. Search engines started appearing as one natural evolution of the World Wide Web and rapidly became some of the most visited sites on the Internet. This is not surprising, because it was incredibly faster to find information based on hierarchical organization or keyword searching than with simple Web surfing, a task that could last for hours and show no practical results. Today, there are tens of thousands of Web servers, and the need for an organized system of information retrieval is greater than ever.

Lycos, Yahoo!, Excite, Infoseek, or Altavista (see list of URLs that follows) and others probably aren't new to you because all these search engines have become quite well known and widely used. Each search engine has its own qualities, and it is difficult to name one as the best overall engine, because they differ in the way they gather information and the way they let you search the corresponding database. Yahoo!, for example, is a database where one must enter a URL for later verification by a human or a program. On the other hand, one of Altavista's characteristics is that it uses a special program usually known as a robot (its nickname is Scooter) to gather information automatically from the Web and other Internet resources. These two strategies result in different databases.

The URLs of the search engines mentioned are

Lycos: http://www.lycos.com/
Yahoo!: http://www.yahoo.com/
Altavista: http://altavista.digital.com/
Infoseek: http://www.infoseek.com/
Excite: http://www.excite.com/

Gathering Information on the Internet

As I have mentioned previously, there are various possible strategies to gather information and construct a Uniform Resource Locator (URL) database about documents on the Web and other Internet resources, such as the Usenet. "Passive" sites just wait for you to enter your own URLs or scan special Usenet newsgroups for URLs. "Active" sites go search information for themselves, using programs know as robots or spiders. A robot is a program that automatically traverses the Web, retrieves documents, and uses the links on the documents to continue its search through the Web. By doing this recursively, robots can index most of the Web (although it may take some days or weeks of continuous work).

After retrieving a page, a robot generally passes information to another program responsible for creating an index database in which every word is related to the pages in which it appears. Searching and indexing words on a page may be accomplished by using one of the following techniques:

Search only on titles and/or headings and/or comments
Search the whole document

In the first case, only the titles, headings, or comments within a page are really referenced on the database. This can save valuable time, space, and computational resources but can result in a much poorer index, because even the best page title can only give "hints" about the page contents. The most powerful search engines use the second technique and index all the text within a page.

After building an index of documents on the Web, one must periodically check if the URLs are still valid. This is done with another program or robot that checks existing references for invalid or moved links. It may run periodically, getting its input from the URL database.

Gathering information about documents available on the Internet is one side of a search engine. The final aim is to make this information available to users in such a way that retrieval of relevant documents is as easy and complete as possible.

Searching Interfaces for the Final User

The search interfaces are implemented on Web pages and allow a user to define what he or she wants to search for. These pages are HTML forms in which the main field permits the introduction of words or phrases and other eventual secondary fields to control the way in which the search itself is done or presented. The form contents are finally passed to a program on the server side as soon as you press the Submit button. These programs on the server side are usually implemented by using the CGI specifications. They receive user's input, such as the search word, case sensitivity choice, maximum number of documents to retrieve, and so on, perform some actions on the background, and send the user an HTML page containing references to the documents found. CGI applications handle user input that results in output but can also pass the actual searching action to another program, a gateway, or query program to a database. If the index database is not very big, it can be implemented by using plain files, and a CGI application handling user's input and output, as well as the information search.

So, forms are the user's doors to all the information available on a search engine. As there are lots of search engines, there are also lots of search pages. Fortunately, they all are similar and easy to use.

Being able to use a search engine on the World Wide Web is useful but requires you to connect to different search engines (if you plan to use more than one) and submit a query to each one. Wouldn't it be nice to have your own customized form from which you could submit queries to every major search engine? You could even develop this idea further and try to submit queries to search engines at the same time, but then you would have to develop a special script to help you do the submission and get the results.

For you to develop your own search form for your favorite search engines, it is necessary to look at the original form and see what the CGI search program is expecting to get as input, which you do by viewing the HTML source of each search form and looking for the <FORM...> </FORM> tags. Also, on different engines, a search script can be implemented by using different call methods (GET or POST). Because a search query will not alter a database, the GET method is generally used to submit the form, although some sites prefer to use POST.

In any event, I recommend you read the copyright statements or use policies of each search engine before copying any HTML or invoking any CGI application from other servers. In general, it is allowed to use the CGI applications from custom forms (that respect the interface of the CGI application, naturally), but you should always check to make sure.

A global search form is only a collection of different search forms available on each search engine. As an example, we will create a custom form for searches on Yahoo! and Lycos:

First, look at the source of http://www.lycos.com/ and copy the source between <FORM ...> and </FORM> tags, removing unwanted images, links, text, or other unimportant tags:

... <form action="/cgi-bin/pursuit" method=GET> <b>Find:</b> <input name="query"><input type=submit value="Go Get It"> <br> <input type=radio name=ab checked value=the_catalog>lycos catalog <input type=radio name=ab value=a2z>a2z directory <input type=radio name=ab value=point> point reviews </form> ...

Do the same thing for Yahoo! (http://www.yahoo.com/) or other search engines you like:

... <form action="http://search.yahoo.com/bin/search"> <input size=25 name=p> <input type=submit value=Search> </form> ...

Finally, combine both forms on a single HTML page and try displaying it by using your browser to see if it works (it should if you proceed this way). See Figure 11.1 for the final page. You can then customize your page (center tables, fields, and so on) and make it more appealing by integrating some graphics used on the remote search engine (most of them permit reutilization of graphics for use on a search form, but you should check this out first, too). The HTML source code for our global search form follows:

Figure 11.1: The Custom search form.

<html> <head> <title>My search form</title> </head> <body> <h1 align=center>My search form</h1> <p> <h2 align=center>Lycos</h2> <form action="http://www.lycos.com/cgi-bin/pursuit" method=GET> <b>Find:</b> <input name="query"><input type=submit value="Go Get It"> <br> <input type=radio name=ab checked value=the_catalog>lycos catalog <input type=radio name=ab value=a2z>a2z directory <input type=radio name=ab value=point> point reviews </form> <p> <h2 align=center>Yahoo</h2> <form action="http://search.yahoo.com/bin/search"> <input size=25 name=p> <input type=submit value=Search> </form> </body> </html>

This form can now sit on your server so that users don't need to connect to the original search engine main form in order to perform information searches on the Internet.

CGI Work in the Background

A search engine, in fact, is made of lots of different programs, each one accomplishing a different task:

An information gatherer (either a robot or a Web interface to receive URLs given by the users)
An index creator or information organizer to catalog information
A Web interface to permit information retrieval

The robot and the index creator or organizer can be independent programs that either speak the HTTP protocol with Web servers around the world and/or catalog information on local disks. On the other hand, Web interfaces are coupled with CGI applications that process users' input. A Web interface for URL additions must get data about the URL submitted by a user and pass it to a program that will either insert it immediately in the database or put it in a queue for later processing by a human or a URL-verifier robot. A Web search interface passes its input to a CGI application that searches the database and sends results back to the user. It is the application parameters-and, at the origin, fields on the HTML form-that define which information will appear on-screen. A URL containing the application call with the different parameters is usually found in the top of a results page:

http://www.lycos.com/cgi-bin/pursuit?query=sams&ab=the_catalog

This URL is a CGI program call and indicates Lycos to search for "sams" in "the catalog," one of Lycos' databases. On Infoseek, the CGI call is quite similar (just ignore the parameters you don't understand):

http://guide-p.infoseek.com/Titles?qt=sams&col=WW&sv=IS&lk=frames

When you click on the Submit button on a search form, you are actually sending your query to the CGI script, either by using the POST or the GET method. Because no updating to the database of URLs will happen when you submit your query, the GET method is generally recommended. POST method submissions are generally reserved for long submissions (with many fields) or for submissions that may alter data on the server. Both methods, however, can be used.

On Lycos, the method used is GET:

<form action="http://www.lycos.com/cgi-bin/pursuit" method=GET>

On Excite, the method used is POST:

<FORM ACTION="http://www.excite.com/search.gw" METHOD=POST>

Your query is received by a CGI application that is responsible for either finding the relevant information or passing the arguments to another custom application on the server that will do this task. This application could be, for example, a relational database gateway or query application. Finally, the result is sent back and displayed on your browser's window. The simplicity of this process hides the power associated with a search engine. Behind the scenes, powerful hardware and software work to find and classify information on an index database, related to the words you submitted (your query). In Altavista, for example, a set of three Alpha Servers with 6 GB RAM and 210 GB hard disk are able to search the current 40 GB database in less than a second! And all this power is available to you from a simple Web page.

Special care should be taken on the search algorithm if you plan to develop your own algorithm on a custom index database and specially if you plan to make it available for everyone on the Internet. Your server could get many hits per day, and the resources used by one invocation of the application are multiplied by the number of users submitting queries. This can rapidly bring your actual server to its knees.

Developing a Simple CGI for a White Pages Database

An electronic White Pages database is an organized list containing e-mail addresses. There is no list containing all the e-mail addresses valid in the Internet, but there are already some lists that contain many e-mail addresses. We will present here a CGI application in Perl that offers users a search interface on the e-mail addresses list. See Listing 11.1 for the source of this script.

Listing 11.1. The White Pages application (a CGI Perl script).

#!/usr/bin/perl ########################################################################### # wp.pl 1.0 - White Pages search script # # &nb sp; # # Antonio Ferreira &n bsp; # # amcf@esoterica.pt ; # # &nb sp; # # April 1996 # ########################################################################### require '/usr/lib/cgi-lib.pl'; ######################### Variables ######################## $url = 'http://www.your_domain.com/cgi-bin/wp.pl'; # White Pages URL $pathBackground = '/bg.gif'; $cat = '/usr/bin/cat'; $tr = '/usr/bin/tr'; $grep = '/usr/bin/grep'; $email_list = '/usr/local/WWW/Docs/WP/email.list'; ########################## Start of main program ########################## &ReadParse(*input); # field=value print &PrintHeader(); # Content-type: text/html\n\n if (&MethGet() || defined $input{'goback.x'}) { # GET &InitialForm(); # ... initial form } else { # POST ... other options if (defined $input{'addForm.x'}) { &AddForm(); } elsif (defined $input{'addEmail.x'}) { &AddEmail(); } elsif (defined $input{'help.x'}) { &Help(); } else { &Search(); } } exit(0); ########################## End of main program ########################## #################### Subroutines ################### ##### Initial search form ##### sub InitialForm { print <<EOM; <HTML> <HEAD> <TITLE>White Pages</TITLE>  </HEAD> <BODY BACKGROUND=$pathBackground> <H1 ALIGN=center>Add an email address to the White Pages database</H1> <P> <FORM ACTION="$url" METHOD=post> <CENTER> <PRE> <B> Name:</B> <INPUT NAME="name" SIZE=40> <B>Company:</B> <INPUT NAME="company" SIZE=40> <B> Email:</B> <INPUT NAME="email" SIZE=40> </PRE> <P> <INPUT TYPE=image SRC="/Images/WP/additwp.gif" NAME=addEmail BORDER=0> <INPUT TYPE=image SRC="/Images/WP/retwp.gif" NAME=goback BORDER=0> </CENTER> </FORM> </BODY> </HTML> EOM } ##### Add email address to the list ##### sub AddEmail { if ( index($input{'email'},'@') >= 0 ) { if ($input{'company'} eq '') { $coment = ">"; } else { $coment = " - ".$input{'company'}.">"; } $line = $input{'email'}." <".$input{'nome'}.$coment; open (LIST,">>$email_list"); print LIST ("\n$line"); close(LIST); print <<EOM; <HTML> <HEAD> <TITLE>Email address added</TITLE>  </HEAD> <BODY BACKGROUND="$pathBackground"> <H1 ALIGN=center>Email address added</H1> <P> <FORM ACTION="$url" METHOD=post> Your email address was included in the White Pages database. <P> <INPUT TYPE=image SRC="/Images/WP/retwp.gif" NAME=goback BORDER=0> </FORM> </BODY> </HTML> EOM } else { print <<EOM; <HTML> <HEAD> <TITLE>Incorrect email address</TITLE>  </HEAD> <BODY BACKGROUND="$pathBackground"> <H1 ALIGN=center>Incorrect email address</H1> <P> <FORM ACTION="$url" METHOD=post> The email you entered is incorrect. Please try again. <P> <INPUT TYPE=image SRC="/Images/WP/retwp.gif" NAME=goback BORDER=0> </FORM> </BODY> </HTML> EOM } } ##### Search on the email address list with the key given ##### sub Search { $search_key = $input{'key'}; if ($search_key eq '') { @final_list = ("The key must contain at least one character!"); } else { $search_key =~ tr/A-Z/a-z/; # Convert to lower case @key = split(" ",$search_key); @initial_list = `$cat $email_list | $tr 'A-Z' 'a-z'`; @final_list = (); foreach $i (0 .. $#initial_list) { if (index($initial_list[$i],$key[0])>=0 && Âindex($initial_list[$i],$key[1])>=0) { $initial_list[$i] =~ s/</</g; $initial_list[$i] =~ s/>/>/g; $initial_list[$i] =~ s/\n/<BR>\n/g; push(@final_list,$initial_list[$i]); } } } if ($#final_list == -1) { @final_list = ("There isn't any email address corresponding to the key you Âgave!"); } print <<EOM; <HTML> <HEAD> <TITLE>Results of the White Pages database search</TITLE>  </HEAD> <BODY BACKGROUND="$pathBackground"> <H1 ALIGN=center>Results of the White Pages database search</H1> <P> <FORM ACTION="$url" METHOD=post> <B>Search for:</B> $search_key <P> <B>Results:</B> <HR> @final_list <HR> <INPUT TYPE=image SRC="/Images/WP/retwp.gif" NAME=goback BORDER=0> </FORM> </BODY> </HTML> EOM } ##### Shows help page ##### sub Help { print <<EOM; <HTML> <HEAD> <TITLE>White Pages - Help</TITLE>  </HEAD> <BODY BACKGROUND="$pathBackground"> <H1 ALIGN=center>White Pages</H1> <H2 ALIGN=center><I>Help</I></H2> <P> <FORM ACTION="$url" METHOD=post> <UL> <LI><B>What is an electronic White Page's centre?</B><BR> It's a list of electronic mail addresses in the Internet. <P> <LI><B>How does search work?</B><BR> The list of email addresses contains the real name of people on the Internet, along with their email address. You can enter up to two words for the program to search on the list and to retrieve documents that contain both words. </UL> <P> <INPUT TYPE=image SRC="/Images/WP/retwp.gif" NAME=goback BORDER=0> </FORM> </BODY> </HTML> EOM }

The script also offers the possibility to add e-mail addresses to the database. The e-mail address database is in reality a plain text file containing e-mail addresses, one per line. Other search engines have more complex databases.

Every search engine-and the White Pages database is a simple one-must have the search form but also some way to gather information. In the White Pages database, this is done with a form for adding e-mail addresses but also by using newsgroups in order to check for new e-mail addresses. Lots of people use newsgroups and send posts. Each post contains the address of the sender in the From: line. Thus, if we manage to build a program that can sequentially browse all posts and catch the From: line information, we can rapidly build a good e-mail address list. In order to do this, you should have access to a news server or have the possibility to copy posts to your server, using a good news reader. The White Pages database main program is a Perl script, but we have developed a small shell script that gathers information on newsgroups. It is presented later in this chapter (see Listing 11.2) and presumes you have access to a news server spool saved on a local disk (the script uses only the soc.culture.* hierarchy for performance reasons).

The main Perl script is divided into two parts: the add e-mail function and the search function. When it starts for the first time, the GET method is used, and the initial form is displayed. See Figure 11.2 for the White Pages main form. On other queries (e-mail addition or help request), the POST method is used.

Figure 11.2: The initial White Page screen.

A user can enter one or two search keys (if there are more than that, they are simply ignored at the moment), and the search will return values containing all search keys (either one or two). Uppercase letters in the search key are converted to lowercase in order for comparison in the list of e-mail addresses to be case insensitive:

$search_key =~ tr/A-Z/a-z/; # Convert to lower case

A result page is shown in Figure 11.3.

Figure 11.3: The results from a White Pages search on "astley."

The e-mail addition form lets users enter their own e-mail addresses and include them on the list. When adding an e-mail address to the database, the application verifies if the address is in the correct form (that is, there is an @ symbol somewhere).

if ( index($input{'email'},'@') >= 1 ) {

Listing 11.2 shows the newsgroups e-mail address gatherer.

Listing 11.2. The newsgroups e-mail address gatherer (shell script) using the soc.culture.* hierarchy on a news server spool directory.

#!/bin/sh # # amcf@esoterica.pt, 1996 # for f in `find /usr/spool/news/soc/culture -depth -type f` do grep "From:" $f 2> /dev/null >> from.list done cut -f2 -d: from.list >> email.list cat email.list | sort -b | uniq > email.list.tmp mv email.list.tmp email.list rm from.list

The email.list file should be kept on a directory of your Web server so that the search script can access it.

Future Improvements

The White Pages database could be improved in several ways:

Addition of a description phrase to each e-mail address, along with other information, such as workplace, country (most of the time it can be guessed from the top domains), and so on.
Improvement of the e-mail addition form in order to let people submit a photo (indicated by a URL) to put next to their e-mail address.
Automatic e-mail sent to each user added to the database to inform him of the addition and eventually check for bad e-mail addresses (if mail is returned).
Better search form, letting users enter not only search keys but also Boolean operators, for example, as in "astley AND NOT bill."

Feel free to use the existing White Pages Perl script code and improve it to fit your needs.

As a general information retrieval and organizer system, you can check out Harvest (http://harvest.cs.colorado.edu/), a valuable tool that can help you build a database of references to information on your server or on other servers, and that can be used as a cache mechanism between client applications and servers (a Web browser and a Web server, for example).

Search engines on the Web have existed for some years and are now indispensable tools for information retrieval. One could not imagine a manual search of the Web or the Internet for a specific topic of information in a time where lots of terabytes flow around the world. As the Web grows, search engines must also grow in both raw power and search/selection capabilities. More powerful servers can (and will) be used, but we also expect improvements on the quality of search algorithms along with improved search forms (for use of natural language in queries).

Summary

This chapter overviewed major searching engines on the World Wide Web as well as their respective search and presentation techniques. As you have seen, most of the work accomplished by these engines is done with the help of CGI scripts.

As an example of a simple search engine, we developed the White Pages database. It allows the maintenance of a list of e-mail addresses in which you can search for a person by providing a search key (the person's name, or part of it) introduced in the White Pages main form.

Chapter 11

Searching and CGI

CONTENTS