|
To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
After you have set the configuration variables, run the script from the command line to create the index. Whenever you want to update the index, run the ice-idx.pl script again. It overwrites the existing index with the new one.
Searching from a Web Browser with ICE The search form presents a choice of directories in a drop-down selection box. You can specify these directories in the script. Listing 31.16 shows how to accomplish this task. Listing 31.16 A Sample ICE Search Script # Title or name of your server: local($title)=ICE Indexing Gateway; # search directories to present in the search dialogue local(@directories)=( Public HTML Directory, Another HTML Directory ); Now you can install the script in your CGI directory and call it from your Web browser. Implementing SWISH-E (Simple Web Indexing System for Humans-Enhanced)SWISH-E is easy to set up and offers fast, reliable searching for Web sites. In indexing HTML files, SWISH-E can ignore data in most tags while giving higher relevance to information in header and title tags. You can also limit your search to words in HTML titles, comments, emphasized tags, and META tags. SWISH-E creates a small and portable index consisting of a single file averaging around 1% to 5% of the size of the original source files. Kevin Hughes wrote the original SWISH program in C for UNIX Web servers. In autumn 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary. SWISH-E is freeware, available from the Berkeley Digital Library Sunsite at Installing SWISH-E is straightforward. After uncompressing and untarring the source files, you edit the SRC/CONFIG.H file and compile SWISH-E for your system. Configuring SWISH-E isnt very hard either. You set up a configuration file, Swish.CONF, which the indexer uses. Listing 31.17 shows a sample SWISH-E configuration file. Listing 31.17 Swish.confA Sample SWISH-E Configuration File # SWISH-E configuration file IndexDir /home/rclark/public_html/ # This is a space-separated list of files and directories you # want indexed. You can specify more than one of these directives. IndexFile index.swish # This is what the generated index file will be. IndexName Index of Small Hours files IndexDescription General index of the Small Hours Web site IndexPointer http://www.aa.net/~rclark/ IndexAdmin Rod Clark (rclark@aa.net) # Extra information you can include in the index file. IndexOnly .html .txt .gif .xbm .jpg # Only files with these suffixes will be indexed. IndexReport 3 # This is how detailed you want reporting. You can specify numbers # 0 to 3 - 0 is totally silent, 3 is the most verbose. FollowSymLinks yes # Put yes to follow symbolic links in indexing, else no. NoContents .gif .xbm .jpg # Files with these suffixes will not have their contents indexed - # only their file names will be indexed. ReplaceRules replace /home/rclark/public_html/ ⇒ http://www.aa.net/~rclark/ # ReplaceRules allows you to make changes to file path names # before theyre indexed. FileRules pathname contains test newsmap FileRules filename is index.html rename chk lst bit FileRules filename contains ~ .bak .orig .000 .001 .old old. .map ⇒ .cgi .bit .test test log- .log FileRules title contains test Test FileRules directory contains .htaccess # Files matching the above criteria will *not* be indexed. IgnoreLimit 80 50 # This automatically omits words that appear too often in the files # (these words are called stopwords). Specify a whole percentage # and a number, such as 80 256. This omits words that occur in # over 80% of the files and appear in over 256 files. Comment out # to turn of autostopwording. IgnoreWords SwishDefault # The IgnoreWords option allows you to specify words to ignore. # Comment out for no stopwords; the word SwishDefault will # include a list of default stopwords. Words should be separated # by spaces and may span multiple directives. After you set up SWISH-E for your site, create the indexes by running SWISH-E from the command line: swish -c swish.conf You can use cron to update the indexes regularly or run the job manually when needed. Alternatively, you can use the AutoSWISH script that is part of the distribution, and which automates the indexing process from an HTML form. Now that you have your indexes, you need some CGI to access them. The distribution includes a sample script, which is also available on the accompanying CD-ROM as swish.cgi.
SWISH-E provides relevance scores, but the scoring algorithm seems to favor small files with little text, among which keywords loom large. Because SWISH-E reports file sizes, it is possible to add a routine to Swish-Web to sort SWISH-Es output by file size. Another useful addition would be a second relevance ranking option that weights file size more heavily. A selection box on the form to limit the results to the first 10, 25, 50, 100, or 250 (or all) results might be another useful addition.
|
Products | Contact Us | About Us | Privacy | Ad Info | Home
Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement. |