With the vast amount of information available through the Web, finding exactly what you want can be a major problem. Enter the search engine, a site dedicated to cataloging the information stored in and around Cyberspace. If it wasn't for such "library" sites, the Web would be a massive collection of Cyber-eddies and backwaters-some of which you'd never find your way into, or out of.
With the number of search engines on the Web increasing almost as quickly as the number of sites, you need a search engine that searches the search engines just to bring things back under control.
All search engines, such as Yahoo!, shown in figure 23.1, rely on a similar interface-you fill out a form and submit it. What happens behind the scenes is also pretty standard.
How the form interfaces with the engine varies, but they all use the same technique: The various configuration options for the engine are passed in as query parameters. In short, each engine uses a form something like this:
<FORM METHOD=GET ACTION="script"> <!-- fields --> </FORM>
where script is the engine-specific script or program that's executed when the form is submitted.
Fields are the form fields where the engine needs to perform the search. These include the text to search for and (optionally) additional fields that control how the search is to be performed-such as "match exactly" or "search UseNet." It's quite common for several of these fields to be selection lists or hidden fields in order to force a particular search type.
If you know what fields a particular engine requires, what the expected values are, and what the ACTION attribute of the <FORM> tag needs to be, you can create your own front-end for individual engines.
Therefore, the first step in building a front-end is to determine how to talk to each engine you want to vçpport. Even though most online engines don't print a simple document detailing their interface, you can still quickly figure out what you need by resorting to "Rule 1" of the Web scripter's code:
"When learning how to do something, look over the shoulder
of someone who's already figured it out."
TIP |
Probably the easiest way to figure out what parameters are used by which engines is to "creatively adapt" the interface of another site that's already incorporating what you're looking for. An excellent place to start is search.com from CNET, Inc. (http://www.search.com/). Claiming 250+ search engines under one roof, it has the most extensive listings of search interfaces of any site on the Web. |
To help cut down the learning curve, the following sections look at the form parameters for several of the most popular search engines. What's covered next is just the tip of the iceberg as far as online engines go, but should serve as a good starting point to designing your own search interface. While many of the engines covered have more form fields than you'll find here, what's listed below are the minimum fields you need to set for executing a search.
Alta Vista (http://www.altavista.digital.com/) is one of
the most popular search
engines on the Web today. Claiming to have indexed over 30 million
pages from almost 270,000 sites, it may well be the most
comprehensive look at Cyberspace under one roof. The <FORM>
tag required to connect to Alta Vista is:
<FORM method=GET action="http://www.altavista.digital.com/cgi-bin/query">
and table 23.1 shows some of the form fields it supports.
Field Name | Field Value |
q | Specifies what's being searched for (as entered by the user). |
pg | Set this to "q." |
what | Controls what is being searched. Possible values are: "Web" (search the Web) and "news" (search UseNet). |
fmt | Controls formatting of search results. Possible values are: "." (standard form), "c" (compact form), and "d" (detailed form). |
Excite (http://www.excite.com/), developed by Architext Software, is an incredibly fast, full-text search engine. The engine is available for a variety of platforms, and can also be found on the companion CD-ROM. From within your own HTML pages, the <FORM> tag to connect to Excite would look like:
<FORM METHOD=POST ACTION="http://www.excite.com/search.gw">
Excite supports several form fields to control the type and breadth of its search, but the only three you need to hook into the engine are:
HotBot (http://www.hotbot.com/) is hosted by HotWired, the "hip" magazine, eZine, and netizen hang out. Based on Inktomi, it takes the following <FORM> tag to access:
<FORM METHOD=GET ACTION="http://www.hotbot.com/search.html" NAME=HSQ>
and table 23.2 details the form fields.
Field Name | Field Value |
MT | Specifies what's being searched for (as entered by the user). |
_v | A version tracking number used by HotBot. Set this to "1.0." |
SM | Defines the kind of search matching desired. Possible values are: "MC" (match all words), "SC" (match any of the words), "phrase" (match the phrase), "name" (find the person), and "url" (find the URL). |
InfoSeek Guide (http://www.infoseek.com/) is a comprehensive and accurate Web list of cataloged and reviewed sites, as well as a subset of a larger, Net-wide commercial service. The <FORM> tag to access InfoSeek looks like:
<FORM METHOD=GET ACTION="http://guide-p.infoseek.com/Titles">
and the neccessary form fields are included in table 23.3.
Field Name | Field Value |
qt | Holds what you're searching for, as entered by the user. |
col | Identifies the collection to search. Normally displayed as a <SELECT> tag, possible values are: "WW" (World Wide Web), "WW,cat_+" (Infoseek Select Sites), "NN" (usenet newsgroups), "CT" (company directory), "EM" (e-mail addresses), "NW" (Timely News), and "FQ" (Web FAQs). |
sv | a tracking field used by InfoSeek. Set this to "A2." |
NOTE |
To make it easier for Web masters to integrate Infoseek into their sites, Infoseek offers a Web Kit where you can select the options you want to use to access the search engine, and they generate the HTML for you. You can access the Web Kit from Infoseek's site (http://www.infoseek.com/). |
Lycos (http://www.lycos.com/) is one of the "granddaddies" of the Web search world, and its age has given it the opportunity to refine its interface. The <FORM> tag to connect to Lycos looks like:
<FORM METHOD=GET ACTION="http://www.lycos.com/cgi-bin/pursuit">
and only one form field is necessary:
For software junkies, shareware.com (http://www.shareware.com) is a virtual Mecca (pardon the pun) on the Internet. Originally called the Virtual Shareware Library (VSL) before CNET took over the job of maintaining it, shareware.com boasts a library of over 210,000 files, covering shareware, freeware, demos, games, drivers, and updates for most every operating system. The <FORM> tag to connect to shareware.com looks like:
<form action="http://search.shareware.com/code/engine/Find">
Because of the vastness of shareware.com's library, its search
engine has more control fields than most, as detailed in table
23.4.
Field Name | Field Value |
search | Specifies what's being searched for (as entered by the user). |
and | Can hold a second search term (also entered by the user). |
logop | Defines how search and and are related. Possible values are: "or" (match either), and "and" (match both). |
hits | Controls the number of hits to display on a page (such as 25). |
frame | Controls whether the search is returned in framed or nonframed mode. This is most often set to "none." |
cfrom | Defines the type of search. This is normally set to "quick." |
orfile | Almost always set to "True." |
category | Controls what part of the shareware database is searched. Possible values are: "MS-Windows," "MS-Windows3.x," "MS-Windows95," "MS-WindowsNT," "Macintosh," "DOS," "OS2," "PC-Games," "UNIX," "Novell-Netware," "Amiga," "Atari," "Source-Code," and "All-Categories." |
WebCrawler (http://www.webcrawler.com/) is America Online's offering to the Web search world. To connect to WebCrawler, you use the following <FORM> tag:
<FORM METHOD=POST ACTION="http://webcrawler.com/cgi-bin/WebQuery">
and define the following form fields as shown in table 23.5.
Field Name | Field Value |
searchtext | Holds what you're searching for (as entered by the user). |
andOr | Controls how the words in the search string are to be treated. Possible values are: "all" (match all words), and "any" (match any word). |
maxHits | Defines the number of hits to return per page. Common values are 10, 25, or 100. |
Web legend Yahoo! (http://www.yahoo.com/) searches its own descriptions of sites (with their URLs and titles). Connecting to Yahoo! requires the following <FORM> tag:
<FORM METHOD=GET action="http://search.yahoo.com/bin/search">
and (like Lycos) only one form field:
NOTE |
For the trivia prone, Yahoo! is actually an acronym. It stands for "You Always Have Other Options." |
As you can see, each engine is a little different. For most purposes, you'll probably not be putting every single engine on one search front-end form. However, even though they differ, they all require the user to fill in one field: what to search for. With this information, and a little arbitration (because you can always "hard code" the other necessary fields for an engine to particular settings), you can connect several sites together through one document.
What's needed is a generic form that can be set to post to whatever search engine is desired, which creates a problem of its own. While JavaScript is the obvious candidate for such manipulation-although, this can easily be done through Perl as well-JavaScript doesn't permit the ACTION attribute of a form to be modified from within the script code.
However, this turns out to be easily overcome, because:
Put these three pieces together, and you have the basis for a "multi-directional" form. Listing 23.1 is an example of such a form configuration.
Listing 23.1 A Basic Multi-Engine Search Form
<FORM method=POST action=""> <B>Search for: </B><INPUT TYPE=TEXT NAME="SearchTerm" VALUE="" SIZE=30><BR> <B>Search on: </B> <SELECT NAME="Engine"> <OPTION>AltaVista <OPTION>Excite <OPTION>InfoSeek <OPTION>Lycos <OPTION>WebCrawler <OPTION>Yahoo </SELECT> <INPUT TYPE=BUTTON VALUE="Search!" onClick="Search(this.form)"> </FORM>
The first thing you should notice about this form is that there is no Submit button, meaning that the form itself is never submitted. However, the Search! button (when clicked) fires its onClick event. In other words, the form has become a kind of "local-submission" form, where the browser, not the server, performs all the work.
The "work" performed by the onClick event itself is straightforward: Construct the proper query, then load the new "page" using the query information. Loading a new page in JavaScript is done by setting the href property of the window to the URL of the new document, as in:
window.href = "http://search.yahoo.com/bin/search?p=JavaScript";
which would call up Yahoo! and search for the term JavaScript,
returning the result as the window's new document.
NOTE |
Remember, a URL doesn't have to be just a domain, path, and file name. Query strings, for search engines and scripts, and hash strings, for referencing anchor tags, are also valid parts of the URL definition. |
Once the user has typed in a search string, selected a particular engine, and clicked the Search! button, it's the job of your JavaScript code to manipulate the user's search request into the correct format for the given search engine. As mentioned before, this involves building a new URL consisting of the domain, path, file (the search program), and one or more query terms. Because each engine varies a bit in the names of the form fields it recognizes (and in the number of additional fields required to successfully complete the search), an easy way to construct the new URL is to break it into two parts:
Listing 23.2 is an example of such a Search() function, building the correct URL according to the engine selected as specified by the form in listing 23.1.
Listing 23.2 The Search() Function
function initArray() { this.length = initArray.arguments.length; for(var i = 0; i < this.length; i++) { this[i+1] = initArray.arguments[i]; } } var Engines = new initArray ( "http://altavista.digital.com/cgi-bin/query?pg=q&what;=web&fmt;=&q;=", "http://www.excite.com/search.gw?search=", "http://guide-p.infoseek.com/Titles?sv=A2;col=WW;qt=", "http://www.lycos.com/cgi-bin/pursuit?query=", "http://query.webcrawler.com/cgi-bin/WebQuery?text=", "http://search.yahoo.com/bin/search?p=" ); function Search(form) { var term = escape(form.SearchTerm.value); var eng = parseInt(form.Engine.selectedIndex); window.location.href = Engines[eng+1] + term; }
A couple of things are worth noting about this code fragment. First, it creates an array of search engine URLs using an array-generation trick slightly different from what you've seen in previous chapters. Instead of allocating an array of the appropriate size and then filling it with data, the initArray() function combines both operations into one single step. Because JavaScript functions have an associated arguments array that contains all the functions parameters, it's easy to query the array for the number of strings defined.
Second, the search term, as entered by the user, is fed through JavaScript's escape() function before it's tacked onto the URL. This takes any special characters, such as spaces, and converts them to %XX format, which is a percent sign followed by a two-digit hexadecimal representation of the character-as in %20 for a space, which is necessary to make certain the URL is processed properly.
NOTE |
In order to be as cross-platform compatible as possible, the Web relies on common delimiters to identify when various fields and data stop and start. The most common delimiter is the space. Therefore, if a URL contains embedded spaces, such as the space separating a sequence of search terms, it's necessary to encode the spaces so they aren't mistaken for end-of-URL markers. |
Finally, once the URL has been constructed, setting the href property of the window's location object executes the search, having the same results as if the user had typed in the URL in the browser's location window.
While JavaScript is the most likely candidate for creating a search engine front-end, you can also achieve the same results through Perl. The major differences between a server-side and client-side interface are:
Listing 23.3 is an example of a Perl version of the search interface.
Listing 23.3 Interfacing to a Search Engine through Perl
#!/usr/local/bin/perl $CRLF = "\r\n"; # '\r\n' for UNIX, '\n' for NT/95 @Engines = ( "AltaVista", "Excite", "InfoSeek", "Lycos", "WebCrawler", "Yahoo" ); @EngineURLs = ( "http://altavista.digital.com/cgi-bin/query?pg=q&what;=web&fmt;=&q;=", "http://www.excite.com/search.gw?search=", "http://guide-p.infoseek.com/Titles?sv=A2;col=WW;qt=", "http://www.lycos.com/cgi-bin/pursuit?query=", "http://query.webcrawler.com/cgi-bin/WebQuery?text=", "http://search.yahoo.com/bin/search?p=" ); if ($ENV{'REQUEST_METHOD'} eq 'POST') { read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs = split(/&/, $buffer); foreach $pair (@pairs) { ($name, $value) = split(/=/, $pair, 2); $tname = $name; $contents{$name} = $value; } } $numEngines = @Engines; for($i=0; $i<$numEngines; $i++) { if($contents{'Engines'} =~ /$Engines[$i]/) { $term = $contents{'SearchTerm'}; $term =~ s/ /%20/g; print "Location: $EngineURLs[$i]$term",$CRLF,$CRLF; exit; } }
Working with two arrays (one who's contents match the <OPTION> tags from the form, the other with the corresponding URLs), the script searches for a match based on the engine the user has selected. Once found, the search term is extracted and encoded. The line:
$term =~ s/ /%20/g;
performs a simple substitution throughout the search term, encoding any embedded spaces into %XX format. After that, the Location: header line instructs the browser to load a new document from the constructed URL.
This chapter demonstrates how to use scripting to create a "common" interface to a collection of different Web search engines, each with its own parameters and settings, in a manner that appears totally transparent to the user. The trick lies in hiding all the differences between the various engines within the form you display and using a bit of scripting to fill in the gaps.
For more information on related topics, check out: