home account info subscribe login search My ITKnowledge FAQ/help site map contact us


 
Brief Full
 Advanced
      Search
 Search Tips
To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Bookmark It

Search this book:
 
Previous Table of Contents Next


After you have set the configuration variables, run the script from the command line to create the index. Whenever you want to update the index, run the ice-idx.pl script again. It overwrites the existing index with the new one.


You can use the UNIX cron utility to schedule your index updates.

Searching from a Web Browser with ICE The search form presents a choice of directories in a drop-down selection box. You can specify these directories in the script. Listing 31.16 shows how to accomplish this task.

Listing 31.16 A Sample ICE Search Script


# Title or name of your server:
local($title)=”ICE Indexing Gateway”;

# search directories to present in the search dialogue
local(@directories)=(
    “Public HTML Directory”,
    “Another HTML Directory”
);

Now you can install the script in your CGI directory and call it from your Web browser.

Implementing SWISH-E (Simple Web Indexing System for Humans-Enhanced)

SWISH-E is easy to set up and offers fast, reliable searching for Web sites. In indexing HTML files, SWISH-E can ignore data in most tags while giving higher relevance to information in header and title tags. You can also limit your search to words in HTML titles, comments, emphasized tags, and META tags. SWISH-E creates a small and portable index consisting of a single file averaging around 1% to 5% of the size of the original source files.

Kevin Hughes wrote the original SWISH program in C for UNIX Web servers. In autumn 1996, The Library of UC Berkeley received permission from Kevin Hughes to implement bug fixes and enhancements to the original binary. SWISH-E is freeware, available from the Berkeley Digital Library Sunsite at

http://sunsite.berkeley.edu/SWISH-E/

Installing SWISH-E is straightforward. After uncompressing and untarring the source files, you edit the SRC/CONFIG.H file and compile SWISH-E for your system.

Configuring SWISH-E isn’t very hard either. You set up a configuration file, Swish.CONF, which the indexer uses. Listing 31.17 shows a sample SWISH-E configuration file.

Listing 31.17 Swish.conf—A Sample SWISH-E Configuration File


# SWISH-E configuration file

IndexDir /home/rclark/public_html/
# This is a space-separated list of files and directories you
# want indexed. You can specify more than one of these directives.

IndexFile index.swish
# This is what the generated index file will be.

IndexName “Index of Small Hours files”
IndexDescription “General index of the Small Hours Web site”
IndexPointer “http://www.aa.net/~rclark/”
IndexAdmin “Rod Clark (rclark@aa.net)”
# Extra information you can include in the index file.

IndexOnly .html .txt .gif .xbm .jpg
# Only files with these suffixes will be indexed.

IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.

FollowSymLinks yes
# Put “yes” to follow symbolic links in indexing, else “no”.

NoContents .gif .xbm .jpg
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.

ReplaceRules replace “/home/rclark/public_html/”
⇒ “http://www.aa.net/~rclark/”
# ReplaceRules allows you to make changes to file path names
# before they’re indexed.

FileRules pathname contains test newsmap
FileRules filename is index.html rename chk lst bit
FileRules filename contains ~ .bak .orig .000 .001 .old old. .map
⇒ .cgi .bit .test test log- .log
FileRules title contains test Test
FileRules directory contains .htaccess
# Files matching the above criteria will *not* be indexed.

IgnoreLimit 80 50
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as “80 256”. This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of autostopwording.

IgnoreWords SwishDefault

# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word “SwishDefault” will
# include a list of default stopwords. Words should be separated
# by spaces and may span multiple directives.

After you set up SWISH-E for your site, create the indexes by running SWISH-E from the command line:

swish -c swish.conf

You can use cron to update the indexes regularly or run the job manually when needed. Alternatively, you can use the AutoSWISH script that is part of the distribution, and which automates the indexing process from an HTML form.

Now that you have your indexes, you need some CGI to access them. The distribution includes a sample script, which is also available on the accompanying CD-ROM as swish.cgi.

Using Swish-Web, a SWISH-E Gateway Swish-Web is in the public domain. If you would like to practice a little programming on it, here are a few ideas for additions to the script.

NOTE: The complete Perl source code for the Swish-Web gateway is on the accompanying CD-ROM as swishweb.cgi. It is an example of a Web gateway for a UNIX command-line program.

SWISH-E provides relevance scores, but the scoring algorithm seems to favor small files with little text, among which keywords loom large. Because SWISH-E reports file sizes, it is possible to add a routine to Swish-Web to sort SWISH-E’s output by file size. Another useful addition would be a second relevance ranking option that weights file size more heavily.

A selection box on the form to limit the results to the first 10, 25, 50, 100, or 250 (or all) results might be another useful addition.


Previous Table of Contents Next


Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.