Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

The sample page in Figure 31.7 demonstrates the two available user interfaces for WebGlimpse.

FIGURE 31.7 The default search forms provided with WebGlimpse enable you to choose the level of search complexity.

The first interface is short, sweet, and perfect for an unobtrusive search facility. The second interface enables the user to select a neighborhood search or a full archive search; choose case sensitivity, partial match, and spelling-error settings; to optionally jump to the line in a found document; and to control the date and number of documents returned.

CAUTION:
One annoying aspect of the WebGlimpse indexing routine is that it automatically appends the user interface code at the bottom of each page it indexes unless you comment out the appropriate line. Although this feature is a nice service for those who want it, being able to turn it off is a must. My personal preference is to add a link to the search facility rather than the entire user interface. Due to WebGlimpse’s concept of page neighborhood, however, putting this code on every page can make sense.

A page neighborhood is obviously context sensitive. You can define a page’s neighborhood, for example, as every other page that is within two jumps (a link to a page that links to one other page). If page A has a link to page B and page C, and each of those pages links to one other page, pages BA and CA, page A’s neighborhood is pages B, C, BA, and CA. However, if you follow the links to page BA, for example, you may find it links to pages D and E, making its neighborhood much different. Because the context determines the neighborhood, you need a unique call to WebGlimpse on each page rather than a generic (search the whole site) search page.

By the same token, if you define a neighborhood as all files in the same directory, the context of the WebGlimpse search changes depending on the starting page.

If your site is massive, or if you want to allow for more context-sensitive searching, you may prefer to have unique calls to WebGlimpse embedded on each page of your site. You might, for example, have a site that offers a number of Web utilities. Each utility is available in a variety of languages and for a variety of operating systems. If a user is reading about one of the programs and wants to know more about its implementation in Perl, he or she doesn’t want to search the entire site and then have to wade through scads of listings for irrelevant utilities. In this instance, a neighborhood search is appropriate. If the site is organized properly, the information should be available either within a few hops or within the same directory.

The output of WebGlimpse looks similar to that shown in Figure 31.8.

This output from WebGlimpse shows that a link is provided to the found document. In addition, context is provided by including all lines in which the search terms are found. WebGlimpse automatically limits the number of found files as well.

An interesting feature of WebGlimpse is its setting for spelling errors. The example given in the documentation is a search for the name “Schwarzkopf.” Many people do not know how to spell this name. Therefore, spelling errors may occur both in the user’s search terms or in the documents on the site. Because WebGlimpse uses GLIMPSE, which in turn builds on the powerful agrep, it supports approximate matching that allows for spelling errors. Thus, if the material on your site comes from a variety of sources, varies in grammatical quality, or if your users can’t spell, the capability to be forgiving of spelling errors is a definite plus.

FIG. 31.8 The search result page from WebGlimpse contains a link to the found page as well as a listing of all found lines.

WebGlimpse basically uses a modified grepping approach, but applies the grepping to an index. Although some flexibility is offered in the spelling–error tolerance feature, complex searches are not offered and no ranking of results by confidence level occurs.

WebGlimpse takes the grepping approach just about as far as it can go. To achieve better results, a more complicated search methodology is needed.

Implementing ICE

Christian Neuss’s ICE search engine produces relevance-ranked results and lists the search words that it finds in each file. It is written in Perl.

There are two scripts. The indexing script, ice-idx.pl, creates an index file that ICE can later search. The indexer runs from the UNIX command line as a standard non-CGI program. The search script, ice-idx.pl, is a CGI script. It searches the index and displays the results on a Web page.

ICE can use an optional external thesaurus in Thesaurus Interchange Format. Christian Neuss notes that ICE has worked well with small thesauri of a few hundred technical terms, but that anyone who wants to use a large thesaurus should contact him for more information.

You can find the current version of ICE on the Net at these two distribution sites, although the German site generally has a much later version:

http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html

http://ice.cornell-iowa.edu/

Indexing Your Files with ICE ICE searches the directories that you specify in the script’s configuration section. When ICE indexes a given directory, it also indexes all its subdirectories.

Five configuration items are at the top of the indexer script. You will need to edit three of them:

@SEARCHDIRS=(
  “/home/user/somedir/subdir/”,
  “/home/user/thisis/another/”,
  “/home/user/andyet/more_stuff/”
);

$INDEXFILE=”/user/home/somedir/index.idx”

# Minimum length of word to be indexed
$MINLEN=3;

The first directory path in @SEARCHDIRS is the default that will appear on the search form. You can add more directory lines in the style of the existing ones, or you can include only one directory if you want to limit what people can see of your files.


	Remember that ICE automatically indexes and searches all the subdirectories of the directories you specify.

ICE’s index is a plain ASCII text file. The following is a sample from the beginning of an ICE index file:

@f /./bookmark.htm
@t Rod Clark s Bookmarks
@m 823231844
1 ABC
1 AFGHANISTAN
1 AGREP
1 AIP
1 ALTNEWS
1 AND
1 ANIMAL
1 ANU
1 ATM
1 AUSTRALIA
1 AsiaLink

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.