Chapter 16 How to Index and Search the Information on Your Site

HTGREP
site-idx.pl
WAIS and Its Kin

Until now, this book has focused on content. Part I presented static content-pages were served up exactly as they were written. Part II introduced dynamic content-pages could do things at runtime, based on the CGI scripts behind them. Parts III and IV introduced specific value-added dynamic content, including specialized forms and chat scripts.

Part V focuses on new kinds of value that you can add to your Web site, to make it a full-fledged service. This chapter addresses ways to make the site easier to navigate, by adding indexes and search engines to help visitors find their way around the site. Chapter 17, "How to Keep Portions of the Site Private," shows how to add value by denying access in certain situations. Chapter 18, "How to Query Databases," expands on concepts from this chapter, and shows how to provide access to large databases.

This chapter discusses three themes related to site searches:

Who does the indexing?
How is the search done?
How are the results presented to the user?

HTGREP

HTGREP is the simplest of the three types of programs examined in this chapter but is still quite powerful. HTGREP was developed by Oscar Nierstrasz at the University of Berne, Switzerland.

With HTGREP, the Webmaster does the indexing. The search is a brute-force search through a text file, and the results are sent back to the user through a filter that can write HTML on the fly.

For our purposes, think of HTGREP as having four parts:

A database, consisting of a flat text file organized into records
A search engine that returns records matching the search criteria
A wrapper that pre-sets many of the HTGREP options
A back-end that formats the output of the search engine into HTML

Here's an example of HTGREP from Nikka Galleria. At any given time, several works are available for purchase. Visitors to the site can find something close to what they're looking for, and then use HTGREP to search the site for similar works. Nikka Galleria is an online art gallery at http://www.dse.com/nikka/.

Here's a portion of the text file that serves as the Nikka database.

#K keywords
#U URL associated with following item
#I image:alt-tag
#T title
#A artist
#S size
#M medium
#P price
#SC size code
#PC price code
K=
U=/nikka/Talent/Works/Crane/tattooEye/tattooEye.shtml
I=/nikka/Talent/Works/Crane/tattooEye/tattooEyeT.gif:Tattoo Stone Eye
U=/nikka/Talent/Works/Crane/tattooEye/tattooEye.shtml
T=Tattoo Stone Eye
U=/nikka/Talent/Painting/3.Crane.shtml
A=Dempsey Crane
M=Mixed Media
S=13 3/4 by 20 inches
SC=Small
P=$2,350
PC=501to2500
K=
U=/nikka/Talent/Works/Crane/earthTribe/earthTribe.shtml
I=/nikka/Talent/Works/Crane/earthTribe/earthTribeT.gif:EarthTribe
U=/nikka/Talent/Works/Crane/earthTribe/earthTribe.shtml
T=Earth Tribe
U=/nikka/Talent/Painting/3.Crane.shtml
A=Dempsey Crane
M=Mixed Media
S=14 by 9 3/4 inches
SC=Small
P=$4,500
PC=Above2500
K=
U=/nikka/Talent/Works/Crane/snakeDance/snakeDance.shtml
I=/nikka/Talent/Works/Crane/snakeDance/snakeDanceT.gif:
Tattoo Stone Snake
U=/nikka/Talent/Works/Crane/snakeDance/snakeDance.shtml
T=Tattoo Stone Snakes
U=/nikka/Talent/Painting/3.Crane.shtml
A=Dempsey Crane
S=20 by 16 inches
SC=Small
M=Mixed Media Original
P=$2,000
PC=501to2500
K=
U=/nikka/Talent/Works/Strain/gettysburg/gettysburg.shtml
I=/nikka/Talent/Works/Strain/gettysburg/gettysburgT.gif:
On To Gettysburg
U=/nikka/Talent/Works/Strain/gettysburg/gettysburg.shtml
T=On To Gettysburg
U=/nikka/Talent/Painting/6.Strain.shtml
A=John Paul Strain
M=Limited Edition Print (1,400 S/N)
S=19 3/4 inches by 27 inches
SC=Medium
P=$165
PC=101to500

Note

Items with a pound sign (#) in front of them are comments.

The design of the data file is up to the Webmaster. In this case, the Nikka designer has chosen to make the records paragraph-sized, with an extra line between paragraphs to separate records. Each line is flagged with an identifier that's used by the custom back-end to produce the proper HTML.

The search engine for HTGREP is contained in the file htgrep.pl, which is called from the wrapper script htgrep.cgi. Many installers customize the wrapper so that not all of HTGREP's options are available to the user. HTGREP comes with a demo form that gives the user access to most of its options. This form is shown in Figure 16.1.

Figure 16.1: The generic HTGREP search form allows access to all the HTGREP controls-but is far too complex to actually use.

Here is the customizable portion of the wrapper. Before the line &htgrep'doit; in the htgrep.cgi wrapper, insert the following lines:

$htgrep'tags{'file'} = "/nikka/works.txt";
$htgrep'tags{'boolean'} = "auto";
$htgrep'tags{'style'} = "none";
$htgrep'tags{'max'} = "250";
$htgrep'tags{'filter'} = "nikka";

The first section restricts the user choices. In this case, the designer has locked the choices as follows:

The file to be searched is works.txt in the nikka directory.
The presence or absence of Boolean operators will be determined automatically.
No style is applied to the output.
A maximum of 250 records will be retrieved.
The output of the search will be passed through a filter named &nikka.

Consider the next set of lines:

# Beat QUERY_STRING into suitable format$_=$ENV{'QUERY_STRING'};
# Ignore things we don't care about
# MacWeb style
s/&.*=Don%27t%20Care//g;
s/^.*=Don%27t%20Care//g;
# Netscape style
s/&.*=Don%27t\+Care//g;
s/^.*=Don%27t\+Care//g;
s/\+/ /g;s/&/ /g;

These lines massage the query string. The script finds each occurrence of the string "Don't care" and removes the corresponding search criterion. This step allows the form to have lines like:

<INPUT TYPE=RADIO NAME=A VALUE="Don't Care"># load the string into the tags array$htgrep'tags{'isindex'} = $_;

Finally the modified query string is passed back into the tags array for use by the search engine.

Suppose that the search begins with the form shown in Figure 16.2.

Figure 16.2: The visitor uses the search form to assemble a combination of search components.

If the user selects artist Dempsey Crane, then the query string is the following:

A=Dempsey+Crane&SC=Don%27t+Care&PC=Don&27t+Care

Once the incoming filter has run, the actual query presented to the database is this:

A=Dempsey Crane

Once the search has run, the search engine returns all the records that match the query string. Since each of Mr. Crane's works contain that exact line, we get back the following:

K=U=/nikka/Talent/Works/Crane/tattooEye/tattooEye.shtml
I=/nikka/Talent/Works/Crane/tattooEye/tattooEyeT.gif:Tattoo Stone Eye
U=/nikka/Talent/Works/Crane/tattooEye/tattooEye.shtml
T=Tattoo Stone Eye
U=/nikka/Talent/Painting/3.Crane.shtml
A=Dempsey Crane
M=Mixed Media
S=13 3/4 by 20 inches
SC=Small
P=$2,350
PC=501to2500
K=
U=/nikka/Talent/Works/Crane/earthTribe/earthTribe.shtml
I=/nikka/Talent/Works/Crane/earthTribe/earthTribeT.gif:EarthTribe
U=/nikka/Talent/Works/Crane/earthTribe/earthTribe.shtml
T=Earth Tribe
U=/nikka/Talent/Painting/3.Crane.shtml
A=Dempsey Crane
M=Mixed Media
S=14 by 9 3/4 inches
SC=Small
P=$4,500
PC=Above2500
K=
U=/nikka/Talent/Works/Crane/snakeDance/snakeDance.shtml
I=/nikka/Talent/Works/Crane/snakeDance/snakeDanceT.gif:
Tattoo Stone Snake
U=/nikka/Talent/Works/Crane/snakeDance/snakeDance.shtml
T=Tattoo Stone Snakes
U=/nikka/Talent/Painting/3.Crane.shtml
A=Dempsey Crane
S=20 by 16 inches
SC=Small
M=Mixed Media Original
P=$2,000PC=501to2500

Now, the back-end filter-specified earlier as &nikka-runs. Here is subroutine nikka:

sub htgrep'nikka{
  &accent'html;
  # Delete keywords
  s/^K=.*/<hr>/;
  s/\n.C=.*//g;
  # Set up images
  s/\nI=(.*):(.*)/\nI=<IMG ALT="$2" SRC="$1">/g;
  # Format URLs
  s/\nU=(.*)\n(\w)=(.*)/\n$2=<a href=$1><b>$3<\/b><\/a>/g;
  # Process images
  s/\I=(.*)/\n$1/g;
  # Artist:
  s/\nA=(.*)/\n<br><b>Artist:<\/b> $1/g;
  # Title:
  s/\nT=(.*)/\n<br><b>Title:<\/b> $1/g;
  # Size:
  s/\nS=(.*)/\n<br><b>Size:<\/b> $1/g;
  # Price:
  s/\nP=(.*)/\n<br><b>Price:<\/b> $1/g;
  # Medium:
  s/\nM=(.*)/\n<br><b>Medium:<\/b> $1/g;
  # Delete comments
  s/\n#.*//g;
  s/^#.*//;}

The first line of this subroutine calls the built-in subroutine accent, which handles various accents. Next, we see this:

  s/^K=.*/<hr>/;
  s/\n.C=.*//g;

The first line replaces the keyword field with the HTML <hr>. The second deletes keyword continuation lines.

Consider the next section:

 # Set up images  s/\nI=(.*):(.*)/\nI=<IMG ALT="$2" SRC="$1">/g;
  # Format URLs
  s/\nU=(.*)\n(\w)=(.*)/\n$2=<a href=$1><b>$3<\/b><\/a>/g;
  # Process images  s/\I=(.*)/\n$1/g;

These lines look for images (lines beginning with I=) and replace those images with the corresponding HTML. Because not all users have graphics turned on (even when browsing an art gallery) the script provides alternative next in the field following the colon.

If any line contains an anchor (denoted by U=), the line below it is wrapped up in the anchor tags.

Finally, a new line is added before the images to improve readability.

Let's look at the remaining lines:

  # Artist:  s/\nA=(.*)/\n<br><b>Artist:<\/b> $1/g;
  # Title:
  s/\nT=(.*)/\n<br><b>Title:<\/b> $1/g;
  # Size:
  s/\nS=(.*)/\n<br><b>Size:<\/b> $1/g;
  # Price:
  s/\nP=(.*)/\n<br><b>Price:<\/b> $1/g;
  # Medium:
  s/\nM=(.*)/\n<br><b>Medium:<\/b> $1/g;
  # Delete comments
  s/\n#.*//g;  s/^#.*//;

Each of these commands converts the terse database notation to formatted HTML. For example,

  # Title:
  s/\nT=(.*)/\n<br><b>Title:<\/b> $1/g;

tells the system to look for occurrences of T= after the newline and replace them with the literal keyword Title: and appropriate HTML fix-ups to make the text presentable.

The finished product is shown in Figure 16.3.

Figure 16.3: The search script output is processed by a back-end processor to write HTML on the fly. Here are the Nikka search results.

For full documentation on HTGREP, including a detailed Frequently Asked Questions list, visit http://iamwww.unibe.ch/~scg/Src/Doc/htgrep.html.

`site-idx.pl`

As its name suggests, site-idx.pl is an indexer, but it bases its work on keywords supplied by the Webmaster on each page. The result of running site-idx.pl is an index file that can be submitted to search engines such as ALIWEB. This section introduces a simple ALIWEB-like search engine that can read an index file and serve up pages based on the index file's contents.

site-idx.pl is the work of Robert S. Thau at Massachusetts Institute of Technology. This program was written to address the indexing needs of ALIWEB at http://web.nexor.co.uk/aliweb/doc/aliweb.html. This program lacks a clever name, but it does its job. It was written to address the indexing needs of ALIWEB, a search engine similar in concept to Yahoo!, Webcrawler, and others.

Unlike most search engines, ALIWEB relies neither on human classifiers (as Yahoo! does) nor on automated means (as the robot-based search sites do). ALIWEB looks for an index file on each Web site, and uses that file as the basis for its classifications.

The indexing is done by the site developer at the time the page is produced, the search is done by ALIWEB (or a local ALIWEB-like CGI script), and the results are presented by that CGI script.

The index file must be named site.idx, and must contain records in the format used by IAFA-compliant FTP sites. For example, the events-list document on the server at the MIT Artificial Intelligence Laboratory produces the following entry in http://www.ai.mit.edu/site.idx:

Template-Type: DOCUMENTTitle: Events at the MIT AI LabURI: /events/events-list.htmlDescription: MIT AI Lab events, including seminars, conferences, and toursKeywords: MIT, Artificial Intelligence, seminar, conference

The process of producing site.idx would be tedious if done by hand. Thau's program automates the process by scanning each file on the site, looking for keywords. The recommended way to supply these keywords is with <META> tags in the header. <META> tags have the following general syntax:

<META NAME="..." VALUE="...">

Valid names include:

description
keywords
resource-type (typically Document for files and Service for search engines);
distribution (typically global).

Remember that the descriptions ultimately appear in a set of search results. Each description should stand alone so that it makes sense in that context. Thau's program uses the HTML <TITLE> tag to generate the document title. Thus, a document at MIT might begin this way:

<TITLE>MIT AI lab publications index</TITLE><META NAME="description" VALUE="Search the index of online and hardcopy-only 
publications at the MIT Artificial Intelligence Laboratory">
<META NAME="keywords" VALUE="Artificial Intelligence, publications"><META NAME="resource-type" VALUE="service">

By default, site-idx.pl looks for the description, keywords, and resource type in <META> tags. This behavior can be overridden so that any document with a title gets indexed, but the override undoes most of the benefits of using site-idx.pl.

Some pages are not appropriate for promotion outside the site. For these pages, change the distribution to local. The script puts the entry for those pages into a file named local.idx.

In addition to announcing site.idx to ALIWEB, the Webmaster can also use a simple ALIWEB-like script (also supplied by Thau) to index the site for local users. This index can point to only local.idx, or the Webmaster can concatenate site.idx and local.idx into a master index of all pages. The latter approach allows a visitor to search all pages
by keyword. Figure 16.4 shows the keywords index field for the Nikka Galleria site at http://www.dse.com/nikka/General/Search.shtml.

Figure 16.4: Using the local search index, the visitor can find relevant pages by keyword.

Figure 16.5 shows the keywords index field for the Nikka Galleria site set up to search for the keywords painting and artist. Figure 16.5 shows the results of a typical query.

Figure 16.5: Nikka Galleria provides access to aliwebsimple.pl.

Figure 16.6: aliwebsimple.pl runs, finding one page that matches the keyword.

Although ALIWEB is not one of the major search engines, the time it takes to add the <META> tags to each page is small when the work is done as the page is produced. site-idx.pl can be set up to run from the crontab, so a site index can be maintained with very little effort. As search engines continue to evolve, the ability to produce an index from the pages without having to revisit each page in the site will be another factor in keeping your site effective.

Here is where to go if you want more information on site-idx.pl: http://www.ai.mit.edu/tools/site-index.html.

WAIS and Its Kin

The last category of programs in this chapter is full-index systems. The archetype of this family is the Wide Area Information Server, or WAIS. This section describes WAIS and its numerous cousins, all of which are characterized by automated indexing, powerful search tools, and a gateway between the database and the Web.

Wide Area Information Server (WAIS) is arguably the most sophisticated site indexer described in this chapter. WAIS started life on specialized hardware (the Connection Machine from Thinking Machines Corporation) but now is available in various forms for use in a conventional UNIX environment.

Much of the original work by Thinking Machines was made available for free. It was so successful that Brewster Kahle, the project leader at Thinking Machines, founded WAIS, Inc. to develop WAIS commercially. Since then it has been customary to refer to the version of WAIS that is freely available as freeWAIS.

Note

To get a good general overview of WAIS, go to the following: http://www.cis.ohio-state.edu/hypertext/faq/usenet/wais-faq/getting-started/
faq-doc-3.html. A great source on how to effectively query WAIS for user-level information is available at http://town.hall.org/util/wais_help.html.

WAIS is a different service than the Web. WAIS is based on ANSI Standard Z39.50, version 1 (also known as Z39.50 88). Clients exist for most platforms, but the most interesting work lies in integrating WAIS databases with the Web.

WAIS, per se, is available in a commercial version and a free version (freeWAIS). WAIS's success, however, has spawned several look-alikes and work-alikes, each of which excels in some aspect.

In general, WAIS-like systems have four components:

An indexer that takes raw documents and generates indexes
The indexes themselves
A server that handles Z39.50 requests
A client that makes Z39.50 requests on behalf of the user

freeWAIS-sf and SFgate

The most advanced version of freeWAIS is freeWAIS-sf, a direct descendant of the original WAIS. Its greatest contribution to the field is its capability to access data in structured fields.

The original freeWAIS and its descendants support free text searches-the ability to search based on all words in the text, and not just "keywords" selected by a human indexer. Free text searches can be a mixed blessing. They are useful when the user is looking for concepts in a block of text such as an abstract or a Web page, but can actually get in the way when a researcher wants to know, for example, which papers have a publication date greater than 1990.

In freeWAIS-sf, structured queries are expressed by a list of search terms separated by spaces. freeWAIS-sf knows about the structure of a document, so the query ti=("information retrieval") looks for the phrase "information retrieval" in the title of the documents it searches, and py>1990 makes sure that the publication year is greater than 1990. freeWAIS-sf is available at: http://ls6-www.informatik.uni-dortmund.de/freeWAIS-sf/README-sf. Using freeWAIS-sf, the site developer can get the best of both worlds. The visitor can search large text fields using free text searches, and can still run structured queries against the fields by name.

freeWAIS-sf has an indexer named waisindex. Given a set of files, waisindex builds indexes in accordance with guidelines given to it in a configuration file. Here are some decisions the installer must make when setting up the configuration file:

What are the boundaries on a document? If the domain is a Web site, then the boundaries might be file boundaries. If the domain is a file of records, then the boundaries might be linefeeds.
Which part of each document constitutes the headline (the part of the document returned when the document matches the search criteria)?
Which documents should be indexed? On a Web site, the HTML pages might be indexed word-for-word, but a GIF file might simply have a few keywords associated with it.

To set up structured fields (which the freeWAIS-sf documentation also refers to as semantic categories), the installer must build one or more format files with rules about how to convert the document contents to fields. Consider the following example:

region: /^AU: /        au "author names" SOUNDEX LOCAL TEXT BOTHend: /^[A-Z][A-Z]:/

This says to the indexer, "For all words starting with 'AU: ' at the beginning of a line, up to a line that starts with two capital letters followed by a colon and a blank, put the word in the default category (so it can be found by a free text search) and the au category (so it can be found in a search for the author). Put its soundex code only in the au category."

The document author(s) and the installer must agree on a document format, so that the format file can prepare a meaningful index. If the document authors routinely use <META> tag keywords in a standardized way, an installer can build a format file to extract the information from those lines.

freeWAIS provides support for relevance ranking and stemming. Relevance ranking gives extra weight to a document when the search terms appear in the headline, or when they are capitalized. This ranking also looks at how frequently a word occurs in the database in general, and gives extra weight to words that are scarce. waisindex also assigns extra weight if the search terms appear in close proximity to each other, or if they appear many times in a document.

freeWAIS-sf offers proximity or string-search operators. During installation, the configure script asks the following:

Use proximity instead of string search? [n]

A yes answer builds the proximity operators into the system.

The Porter stemming algorithm built into waisindex and waisserver allows a document containing something like "informing" to match a query for the term "informs."

freeWAIS-sf also supports synonyms, and asks the following question during installation:

Do you want to use shm cache? [n]

If the site's synonym file is larger than 10K and the machine supports shared memory, then answering yes speeds up waisserver by a significant factor.

To access a freeWAIS-sf database from the Web, use SFgate, a CGI program that uses waisperl, an adaptation of Perl linked with the freeWAIS-sf libraries. You can find SFgate at http://ls6-www.informatik.uni-dortmund.de/SFgate/SFgate.html.

Because the Web and WAIS use two different protocols (HTTP and Z39.50, respectively) there must be some program or programs between the user and the database to format the query and present the responses. One approach is to use a CGI front-end to a WAIS server. SFgate supports this option, but can go even further. Figure 16.7 shows a typical SFgate installation. SFgate talks to several WAIS servers and integrates their responses. By using waisperl, SFgate can bypass the WAIS server on the local machine and search the database itself, dramatically decreasing the time required to search the database, as shown in Figure 16.8.

Figure 16.7: SFgate can query WAIS servers across the network.

Figure 16.8: SFgate can directly access a local WAIS database, significantly improving performance.

Figure 16.9 shows the demo HTML page supplied with SFgate, and Figure 16.10 shows the results of a sample search. You can find the demo HTML page for SFgate at http://ls6-www.informatik.uni-dortmund.de/SFgate/demo.html.

Figure 16.9: Use the SFgate demo query form to get a feel for what SFgate can do.

Figure 16.10: Here are the results of a query against the demo database.

Notice that, unlike many search systems, SFgate can retrieve multiple documents-the results shown in Figure 16.11 present the structured fields of each document, and provide links to the full text of each document.

Figure 16.11: After selecting the documents returned from the query, each structured field is visible.

SWISH

A simpler WAIS-like program is Simple Web Indexing System for Humans (SWISH), developed by Kevin Hughes. SWISH is available at http://www.eit.com/software/swish/swish.html.

As its name implies, SWISH was designed specifically for indexing Web pages. This means that many of the configuration options available in programs like freeWAIS are gone, simplifying the configuration process and reducing the demands on the system. SWISH produces a single index file about half the size of the index produced by WAIS. The downside of this simplification is that some features, such as stemming and the use of a synonym table, are lost. Unlike freeWAIS, SWISH can only search files on the local machine. For many purposes, SWISH's simplified installation and smaller indexes are well worth the lost capabilities.

To build an index, specify the files and directories to be indexed in the configuration file under the variable IndexDir, and then run SWISH using the -c option to identify the configuration file. SWISH puts the index in the file identified in the configuration file by the variable IndexFile. You should be aware that indexing can take a lot of memory. Run the indexer from the crontab when the load on the system is low. (See Chapter 12, "Forms for Batching Processes," to learn ways to batch large jobs.) If there isn't enough memory to index the entire site, index a few directories at a time, and then merge the results using the command-line option.

Queries can be run from the command line like this:

swish -f sample.swish -w internet and resources and archie

This query tells SWISH to look for documents with the words "internet" and "resources" and "archie" in the database "sample.swish." SWISH searches are case-insensitive.

There's a command-line option to make SWISH resemble freeWAIS-sf. For example, consider this line:

swish -f sample.swish -w internet and resources and archie -t the

It tells SWISH to look for the three words in specific locations: titles, headers, and emphasized tags.

To make SWISH available from the Web, use a gateway CGI program like wwwwais. Once wwwwais is compiled, follow the online instructions to build a configuration file. The online instructions describe how to set up wwwwais.c for both freeWAIS and SWISH. Be sure to follow the correct set of directions. wwwwais is available at http://www.eit.com/software/wwwwais/.

Note

Because wwwwais is a C program, you need access to a compiler for your server. If the server has no native ANSI C compiler, get gcc from the Free Software Foundation. Alternatively, Ready-to-Run Software in Groton, Massachusetts (1-800-743-1723 or 1-508-692-9922) sells executable versions of this compiler for most common UNIX platforms-its catalog is worth having.

Now, build an HTML form like this one:

<FORM METHOD=GET ACTION="/cgi-bin/wwwwais">

Search for <INPUT TYPE=text NAME=keywords SIZE=40><BR>.

<INPUT TYPE=Submit VALUE=Search>
</FORM>

Any queries entered in the field are passed directly to SWISH.

An HTML page can also call wwwwais using GET and PATHINFO, so the following lines all work:

/cgi-bin/wwwwais?these+are+keywords/cgi-bin/wwwwais?keywords=these+are+keywords&maxhits=40/cgi-bin/wwwwais/host=quake.think.com&port=210

The online documentation provides several pages of examples.

To keep a malicious user from being able to set up her own queries, consider running wwwwais from a shell or Perl wrapper (as shown earlier in this chapter for HTGREP). To specify the parameters from a wrapper, put WWW_in front of the parameter name (for example, WWW_HOST=quake.think.com, WWW_PORT=210). Strip any attempts to set these parameters out of QUERY_STRING or STD_IN, and call wwwwais on the keywords.

GLIMPSE

One of the most powerful WAIS-like systems available is GLIMPSE (GLobal IMPlicit SEarch). GLIMPSE was developed by the same folks who developed agrep, the search engine for HURL described in Chapter 10, "Integrating Forms with Mailing Lists." You'll find GLIMPSE described in detail at this location: http://glimpse.cs.arizona.edu: 1994/glimpse.html.

It includes all features of agrep, including the ability to conduct "approximate" searches.

The GLIMPSE indexer is named glimpseindex. Like the other indexers in this section, it is commonly run from the crontab to keep the indexes for a site up-to-date. Like freeWAIS-sf, GLIMPSE includes support for structured queries. This combination of agrep features with structured queries allows powerful queries to be constructed in just a few words. For example,

glimpse -2 -F html Anestesiology

finds all matches to the misspelled word Anestesiology in files with html (including shtml) in their name, with at most two errors. With this approximate matching, even if users make a typo when entering their query, they are likely to get useful results.

The line

glimpse -F 'mail;1993' 'windsurfing;Arizona'

finds documents that have mail and 1993 in their name and contain the words windsurfing and Arizona.

The line

glimpse -F 'mail;type=Directory-Listing' 'windsurfing;Arizona'

searches among documents that have mail in their name and are of type Directory- Listing for documents containing windsurfing and Arizona.

As a final example, the line

glimpse -F 'mail;1993' 'au=Manber'

searches among files that have mail and 1993 in their name for an author value (in the au field) equal to Manber.

The developers of GLIMPSE are adding a new operator, cast, to allow GLIMPSE to search compressed files. Not only does this option save disk space, but searches of compressed files are significantly faster than searches of an uncompressed source. More information is available at the GLIMPSE Web site referenced above.

A Web gateway to GLIMPSE patterned after wwwwais.c would not be difficult to write. A better solution for sites with complex requirements, however, is to install HARVEST, described in the next section.

Understanding HARVEST

The HARVEST project, developed at the University of Colorado, is easily the richest and most complex indexing system described in this chapter. Any Webmaster willing to take the time to set up HARVEST is able to provide a powerful search solution to site visitors.

HARVEST has three major components:

The Gatherer
The Broker
The Replicator

A simple HARVEST implementation requires only the Gatherer and the Broker.

The Gatherer

The Gatherer is the indexer of HARVEST. It can be pointed at a variety of information resources, including FTP, Gopher, the Web, UseNet, and local files. For each of these resources, the Gatherer invokes a summarizer that produces a summary document in Summary Object Interchange Format (SOIF). SOIF is similar to, but more extensive than, the IAFA format used by ALIWEB and described earlier in this chapter. To summarize HTML documents, the Gatherer uses an SGML summarizer and the DTD of HTML. Recall from Chapter 2, "Reducing Site Maintenance Costs Through Testing and Validation," that validators use these same DTDs to report syntax errors in a page's HTML. HARVEST's design, therefore, is picky about syntax. If a site's pages pass validation, configure the Gatherer to run with syntax_check=1 in $HARVEST_HOME/lib/gatherer/SGML.sum. If the HTML on a site is poor, and will not pass validation, leave syntax_check at 0, but be prepared for the summarizer to produce less useful results-or perhaps crash. For best results, go back and fix all pages so that they validate before you attempt to run the SGML summarizer.

The table by which the SGML summarizer builds the SOIF summary from an HTML document is the HARVEST HTML table, located after installation at $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sub.tbl. By default, this table summarizes as shown in Table 16.1.

Table 16.1 HARVEST Maps HTML Elements into SOIF Attributes

HTML Element	SOIF Attribute
`<A>`	keywords, parent
`<A:HREF>`	url-references
`<ADDRESS>`	address
`<B>`	keywords, parent
`<BODY>`	body
`<CITE>`	references
`<CODE>`	ignore
`<EM>`	keywords, parent
`<H1>`	headings
`<H2>`	headings
`<H3>`	headings
`<H4>`	headings
`<H5>`	headings
`<H6>`	headings
`<HEAD>`	head
`<I>`	keywords, parent
`<META:CONTENT>`	$NAME
`<STRONG>`	keywords, parent
`<TITLE>`	title
`<TT>`	keywords, parent
`<UL>`	keywords, parent

The notation "keywords, parent" means that the words in the HTML element (for example, <EM> or <STRONG>) are copied to the SOIF keywords section and are also left in the content of the parent element. This way, the document remains readable.

Notice that the <META> tag gets special handling by the summarizer. If the original page contains

<META NAME="author" CONTENT="Michael Morgan">

then the SOIF summary contains the following:

author{14}: Michael Morgan

If the HTML document has been built following the recommendations given for the IAFA format used with site-idx.pl, then the summarizer finds those <META> tags and transforms something like

<META NAME="keywords" CONTENT="Nikka Galleria, art, art gallery">

to the following:

keywords{32}: Nikka Galleria, art, art gallery

Like other indexers described in this section, the Gatherer must be rerun from the crontab each night to keep the index current. Once started, the Gatherer daemon (gathered) continues to run in the background. Each summary object is assigned a time-to-live value when it's constructed. If a summary expires before it's rebuilt, then it gets removed from the index. Typical time-to-live values range from one to six months.

The HARVEST developers provide the following sample code to demonstrate how to run a specific summarizer from the crontab:

#!/bin/sh#
#  RunGatherer - Runs the ATT 800 Gatherer (from cron)
#
HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
       PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH
export PATH
cd ${HARVEST_HOME}/gatherers/att800exec Gatherer att800.cf

As an alternative to using cron, a Webmaster using make as described in Chapter 6, "Reducing Maintenance Costs with Server Side Includes," could add a line to the makefile, calling the Gatherer. This way, the index would always be up-to-date.

The HARVEST developers also recommend running the RunGatherd command whenever the system is started so that the Gatherer's database is exported. Here's a sample RunGatherd script:

#!/bin/sh#
# RunGatherd - starts up the gatherd process (from /etc/rc.local)
#
HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
PATH=${HARVEST_HOME}/lib/gatherer:$PATH; export PATHgatherd -dir ${HARVEST_HOME}/gatherers/att800/data 8001

The Broker

The Broker is the component of HARVEST responsible for searching the index in response to queries. Just as there can be multiple Gatherers, each looking at a different information resource, a HARVEST site can run more than one Broker, each offering different options. The default search engine for HARVEST is GLIMPSE, but you can build a Broker around WAIS as well.

Queries may take the form of single words, phrases (enclosed in quotation marks), or structured queries. Consider the following example:

(Author:Morgan) AND (Type: HTML) AND HARVEST

This line returns all documents where the author field contains Morgan, the type is HTML, and the document mentions the word HARVEST.

HARVEST includes a Web gateway to the Broker with several demos available. Figures 16.12 and 16.13 show a demo of HARVEST as a front-end for UseNet archives. That demo site is at http://harvest.cs.colorado.edu/Harvest/brokers/Usenet/.

Figure 16.12: Use the HARVEST demo site to become familiar with HARVEST's capabilities.

Figure 16.13: The Broker returns the lines that match the query.

The Replicator

The Replicator is a powerful option in HARVEST. Using the Replicator, one site-the master site-can notify other copies of HARVEST about changes in its database. Suppose that a company maintains its technical support documents online. They can maintain a master copy that's indexed at the headquarters, and keep a mirrored index at each field office. When the master copy changes, the Replicator propagates those changes to the field offices. A query at any field office is run against the local index. Any documents that are retrieved are fetched from their home site (and, optionally, stored in a cache at the field office). If they are cached, future requests for that document can be satisfied locally, and will not have to go out over the network. Since HARVEST can index documents distributed across the Net, the documents that serve as the basis for the index can reside anywhere on the Net, and do not necessarily have to be on the local server at the headquarters.

For more information about HARVEST visit http://harvest.cs.colorado.edu/ and http://harvest.cs.colorado.edu/harvest/FAQ.html.

This chapter describes a series of programs that allow a visitor to search the Web site. These programs range from HTGREP, in which the site developer prepares the index and the program searches the index by brute force, to highly sophisticated indexers and search engines in freeWAIS-sf, GLIMPSE, SWISH, and HARVEST. Coupling a search engine into a Web site increases the value of the site and makes that site a better resource for visitors-the site becomes a place they'll come back to again and again to take advantage of the searchable material.

Chapter 16

How to Index and Search the Information on Your Site

CONTENTS

HTGREP

site-idx.pl

WAIS and Its Kin

freeWAIS-sf and SFgate

SWISH

GLIMPSE

Understanding HARVEST

The Gatherer

The Broker

The Replicator

`site-idx.pl`