Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

Unfortunately, a side effect of this context approach is that multiple paragraphs from each found page can be returned. Although this may help further guide the user, many may find it an annoyance. You may want to modify the Htgrep code to cause it to proceed to the next file upon finding a search hit. Doing this, however, might cause the search to skip particularly relevant material. What is really needed is a more sophisticated approach that evaluates the fitness of a document based on other rules, such as the number of hits per document and the proximity of words found as a result of multiple-word searches. It is difficult to add this level of sophistication to a grepping search engine. As discussed later in this chapter, you can find such features in some indexing search engines.


	Troubleshooting
	If you make any changes in these scripts, you can test them for syntax errors before installing the script in your CGI-bin directory. Give the script execute permission for your account, and then type its filename at the command line. The output will be either the default form (if the syntax is correct) or a syntax error message (if it’s not).

Htgrep also enables you to set the maximum number of records to return. This is an important feature because no provision exists in Htgrep to ignore stop words. Unfortunately, there is also no way to prevent Htgrep from returning really long records. Assume, for example, that you define <P> as your record delimiter. If you add a new document that uses <p> for paragraphs, or if you have long material contained within <PRE> tags, the result can be huge amounts of text returned on the results page. To solve this problem, you can modify the code to include a line counter that aborts the paragraph retrieval if it is longer than 200 words. The following code fragment contains this modification:

# this is where Htgrep actually searches the file
          while (<FILE>) {
# call the subroutine that evaluates the search terms
               $queryCommand
# optional filter definition
               $filter
# remove all the nasty tags that can disturb paragraph display
               s/\<table/\<p/g ;
               s/\<hr/\<p/g ;
               s/\<HR/\<p/g ;
               s/\<IMG/\<p/g ;
               s/\<img/\<p/g ;
# transform relative URLs in found pages to full URLs
               if ((/\<A HREF/) && !(/http/) && !(/home/)) {
                    s/\<A HREF \= \”/\<A HREF \= \”\$dirname/g ;}
               print \$url;
# count the number of words
               \@words = split(‘ ‘, \$_);
               \$wordcount = 0;
               foreach \$word (\@words)
               {
                    \$wordcount++;
               }
# if it’s too large, don’t print the record
               if (\$wordcount >= 200)
               {
                    print “\<H4\>Excerpt would be greater than 200 \n”;
                    print “words. Select link above to see entire \n”;
                    print “page.\<\/H4\>\\n”;
# skip to next record
                    next;
               }
# otherwise print out the record
               print;
# if you’ve printed up to the limit, stop
               last if (++\$count == $maxcount);
          }

Another side effect of returning the whole paragraph concerns what else besides text is returned. Because Htgrep grabs the whole paragraph, it also grabs links to images, bits of Java, JavaScript, or ActiveX code, and anything else contained in the paragraphs. This is probably not what the user wants when using a search engine. The resulting hits page can contain dozens of large GIFs and take a long time to download.


	If your search engine returns blocks of text from the found files, be sure your script processes the text before displaying it to remove image references or program code. Failure to do so can result in pages containing many large images that are time-consuming to download.

Because of this limitation, I modified the Htgrep script to remove all <IMG> tags. I must confess, I did this in a decidedly low-tech way by replacing all instances of <IMG with <P in all found paragraphs (see the previous example). It is crude, but effective. The resulting hits page is devoid of image tags (see Figure 31.6).

FIGURE 31.6 All image references are removed from the search results page produced by a modified Htgrep.

You will notice that another script modification produces a hyperlink to the found page, something that the base Htgrep script only provides if you elect plain-text formatting.

A security problem occurs with using Htgrep that you must take care of in the wrapper script. Because the search string can be a Perl regular expression, it executes using Perl’s eval function. This can allow your users to execute arbitrary commands on your Web server. To prevent this from happening, be sure to prescreen search terms for dangerous characters or expressions, especially !sh, in the CGI wrapper that you use to call Htgrep.

CAUTION:
If your Perl script uses the eval command to execute the search, you need to preprocess the search terms to prevent users from executing processes on your Web server.

Another nice feature of Htgrep is that, on NCSA servers, it ignores any directories that contain an access control file (.Htaccess). Chances are, you don’t want users searching these directories anyway. If you want finer control over what directories are searched, you can put a .Htaccess file in your backup, administration, or internal directories. Other search engines require you to explicitly exclude such directories from the search and that leads to administrative overhead for the poor Webmaster.

Implementing the Hukilau 2 Search Engine Hukilau is a search script that searches through all the files in a directory. It can be very slow, so it is not practical for every site.

You may use this script for noncommercial purposes free of charge. A single-site commercial license is available for $250.

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.