|
To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
The sample page in Figure 31.7 demonstrates the two available user interfaces for WebGlimpse.
The first interface is short, sweet, and perfect for an unobtrusive search facility. The second interface enables the user to select a neighborhood search or a full archive search; choose case sensitivity, partial match, and spelling-error settings; to optionally jump to the line in a found document; and to control the date and number of documents returned.
A page neighborhood is obviously context sensitive. You can define a pages neighborhood, for example, as every other page that is within two jumps (a link to a page that links to one other page). If page A has a link to page B and page C, and each of those pages links to one other page, pages BA and CA, page As neighborhood is pages B, C, BA, and CA. However, if you follow the links to page BA, for example, you may find it links to pages D and E, making its neighborhood much different. Because the context determines the neighborhood, you need a unique call to WebGlimpse on each page rather than a generic (search the whole site) search page. By the same token, if you define a neighborhood as all files in the same directory, the context of the WebGlimpse search changes depending on the starting page. If your site is massive, or if you want to allow for more context-sensitive searching, you may prefer to have unique calls to WebGlimpse embedded on each page of your site. You might, for example, have a site that offers a number of Web utilities. Each utility is available in a variety of languages and for a variety of operating systems. If a user is reading about one of the programs and wants to know more about its implementation in Perl, he or she doesnt want to search the entire site and then have to wade through scads of listings for irrelevant utilities. In this instance, a neighborhood search is appropriate. If the site is organized properly, the information should be available either within a few hops or within the same directory. The output of WebGlimpse looks similar to that shown in Figure 31.8. This output from WebGlimpse shows that a link is provided to the found document. In addition, context is provided by including all lines in which the search terms are found. WebGlimpse automatically limits the number of found files as well. An interesting feature of WebGlimpse is its setting for spelling errors. The example given in the documentation is a search for the name Schwarzkopf. Many people do not know how to spell this name. Therefore, spelling errors may occur both in the users search terms or in the documents on the site. Because WebGlimpse uses GLIMPSE, which in turn builds on the powerful agrep, it supports approximate matching that allows for spelling errors. Thus, if the material on your site comes from a variety of sources, varies in grammatical quality, or if your users cant spell, the capability to be forgiving of spelling errors is a definite plus.
WebGlimpse basically uses a modified grepping approach, but applies the grepping to an index. Although some flexibility is offered in the spellingerror tolerance feature, complex searches are not offered and no ranking of results by confidence level occurs. WebGlimpse takes the grepping approach just about as far as it can go. To achieve better results, a more complicated search methodology is needed. Implementing ICEChristian Neusss ICE search engine produces relevance-ranked results and lists the search words that it finds in each file. It is written in Perl. There are two scripts. The indexing script, ice-idx.pl, creates an index file that ICE can later search. The indexer runs from the UNIX command line as a standard non-CGI program. The search script, ice-idx.pl, is a CGI script. It searches the index and displays the results on a Web page. ICE can use an optional external thesaurus in Thesaurus Interchange Format. Christian Neuss notes that ICE has worked well with small thesauri of a few hundred technical terms, but that anyone who wants to use a large thesaurus should contact him for more information. You can find the current version of ICE on the Net at these two distribution sites, although the German site generally has a much later version: Indexing Your Files with ICE ICE searches the directories that you specify in the scripts configuration section. When ICE indexes a given directory, it also indexes all its subdirectories. Five configuration items are at the top of the indexer script. You will need to edit three of them: @SEARCHDIRS=( /home/user/somedir/subdir/, /home/user/thisis/another/, /home/user/andyet/more_stuff/ ); $INDEXFILE=/user/home/somedir/index.idx # Minimum length of word to be indexed $MINLEN=3; The first directory path in @SEARCHDIRS is the default that will appear on the search form. You can add more directory lines in the style of the existing ones, or you can include only one directory if you want to limit what people can see of your files.
ICEs index is a plain ASCII text file. The following is a sample from the beginning of an ICE index file: @f /./bookmark.htm @t Rod Clark s Bookmarks @m 823231844 1 ABC 1 AFGHANISTAN 1 AGREP 1 AIP 1 ALTNEWS 1 AND 1 ANIMAL 1 ANU 1 ATM 1 AUSTRALIA 1 AsiaLink
|
Products | Contact Us | About Us | Privacy | Ad Info | Home
Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement. |