|
To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
Unfortunately, a side effect of this context approach is that multiple paragraphs from each found page can be returned. Although this may help further guide the user, many may find it an annoyance. You may want to modify the Htgrep code to cause it to proceed to the next file upon finding a search hit. Doing this, however, might cause the search to skip particularly relevant material. What is really needed is a more sophisticated approach that evaluates the fitness of a document based on other rules, such as the number of hits per document and the proximity of words found as a result of multiple-word searches. It is difficult to add this level of sophistication to a grepping search engine. As discussed later in this chapter, you can find such features in some indexing search engines.
Htgrep also enables you to set the maximum number of records to return. This is an important feature because no provision exists in Htgrep to ignore stop words. Unfortunately, there is also no way to prevent Htgrep from returning really long records. Assume, for example, that you define <P> as your record delimiter. If you add a new document that uses <p> for paragraphs, or if you have long material contained within <PRE> tags, the result can be huge amounts of text returned on the results page. To solve this problem, you can modify the code to include a line counter that aborts the paragraph retrieval if it is longer than 200 words. The following code fragment contains this modification: # this is where Htgrep actually searches the file while (<FILE>) { # call the subroutine that evaluates the search terms $queryCommand # optional filter definition $filter # remove all the nasty tags that can disturb paragraph display s/\<table/\<p/g ; s/\<hr/\<p/g ; s/\<HR/\<p/g ; s/\<IMG/\<p/g ; s/\<img/\<p/g ; # transform relative URLs in found pages to full URLs if ((/\<A HREF/) && !(/http/) && !(/home/)) { s/\<A HREF \= \/\<A HREF \= \\$dirname/g ;} print \$url; # count the number of words \@words = split( , \$_); \$wordcount = 0; foreach \$word (\@words) { \$wordcount++; } # if its too large, dont print the record if (\$wordcount >= 200) { print \<H4\>Excerpt would be greater than 200 \n; print words. Select link above to see entire \n; print page.\<\/H4\>\\n; # skip to next record next; } # otherwise print out the record print; # if youve printed up to the limit, stop last if (++\$count == $maxcount); } Another side effect of returning the whole paragraph concerns what else besides text is returned. Because Htgrep grabs the whole paragraph, it also grabs links to images, bits of Java, JavaScript, or ActiveX code, and anything else contained in the paragraphs. This is probably not what the user wants when using a search engine. The resulting hits page can contain dozens of large GIFs and take a long time to download.
Because of this limitation, I modified the Htgrep script to remove all <IMG> tags. I must confess, I did this in a decidedly low-tech way by replacing all instances of <IMG with <P in all found paragraphs (see the previous example). It is crude, but effective. The resulting hits page is devoid of image tags (see Figure 31.6).
You will notice that another script modification produces a hyperlink to the found page, something that the base Htgrep script only provides if you elect plain-text formatting. A security problem occurs with using Htgrep that you must take care of in the wrapper script. Because the search string can be a Perl regular expression, it executes using Perls eval function. This can allow your users to execute arbitrary commands on your Web server. To prevent this from happening, be sure to prescreen search terms for dangerous characters or expressions, especially !sh, in the CGI wrapper that you use to call Htgrep.
Another nice feature of Htgrep is that, on NCSA servers, it ignores any directories that contain an access control file (.Htaccess). Chances are, you dont want users searching these directories anyway. If you want finer control over what directories are searched, you can put a .Htaccess file in your backup, administration, or internal directories. Other search engines require you to explicitly exclude such directories from the search and that leads to administrative overhead for the poor Webmaster. Implementing the Hukilau 2 Search Engine Hukilau is a search script that searches through all the files in a directory. It can be very slow, so it is not practical for every site. You may use this script for noncommercial purposes free of charge. A single-site commercial license is available for $250.
|
Products | Contact Us | About Us | Privacy | Ad Info | Home
Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement. |