Click Here!
home account info subscribe login search My ITKnowledge FAQ/help site map contact us


 
Brief Full
 Advanced
      Search
 Search Tips
To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Bookmark It

Search this book:
 
Previous Table of Contents Next


This script works; it finds instances of a search string in all files in a directory tree. But it ignores some problems and is definitely lacking in features. It would be nice, for example, to be able to specify the search to be case sensitive and whether multiple words should be treated as Boolean AND or OR. The display does not provide a link to the found files. Another missing feature is the context of the search hit. You know that the search terms are found in these files, but you have no idea if the use of them is trivial or important. You don’t know how many times the search string was found, and you have no way to evaluate the relevance of a file.


NOTE:  Rarely on a site is there a directory tree in which every HTML file and directory is available to the public. On my own site, many protected directories require a user ID and password to access. In addition, a number of experimental files, backup files, or other files are not linked to the main site and are not for public consumption. This rudimentary script searches all files on the site regardless of whether they are protected.

Implementing a Third-Party Grepping Search Engine

Several very popular grepping search engines are available on the Web. The following sections examine three of them:

  Matt’s Simple Search Engine by Matthew M. Wright, author of the famous Matt’s Perl Script Archive
  Htgrep by Oscar Nierstrasz
  Hukilau 2 from Adams Communications

All are written in Perl, and each has a little something to recommend it. All solve many of the problems mentioned in the last section and provide added functionality.

Implementing Matt’s Simple Search Engine You can find Matt’s Simple Search Engine in Matt’s Script Archive at http://www.worldwidemart.com/scripts/, one of the most popular Perl script archives on the Web.

Implementing Matt’s search engine is fairly simple: get the distribution archive, install it on your site, configure it, and create a search form. To configure the script, you need to edit several lines at the top to point to the base directory. The base directory is the base URL for the site and is used to create links to the found pages. You also need to insert a title to put on the resulting page and furnish links for the home page and search page.

Because Matt’s script does not do recursion, you also need to specify all the subdirectories you want searched. This can be tedious to maintain as your site changes, so you may want to modify the file finding script from the previous example and combine it with calls to Matt’s engine to perform the search.

After you finish configuring, you need to create a page that incorporates something similar to Listing 31.11.

Listing 31.11 mattform.txt—A Simple Form Allowing the Selection of Search Parameters


<FORM method=POST
        action=”http://worldwidemart.com/scripts/cgi-bin/demos/search.cgi”>
<CENTER><TABLE border>
<TR>
<TH>Text to Search For: </TH>
<TH><INPUT type=text name=”terms” size=40><BR></TH>
</TR><TR>
<TH>Boolean: <SELECT name=”boolean”>
<OPTION>AND
<OPTION>OR
</SELECT> </TH><TH>Case <SELECT name=”case”>
<OPTION>Insensitive
<OPTION>Sensitive
</SELECT><BR></TH>
</TR><TR>
<TH colspan=2><INPUT type=submit value=”Search!”>
<INPUT type=reset><BR></TH>
</TR></TABLE></FORM></CENTER>
<HR size=7 width=75%><P>

This form produces a Web page similar to that shown in Figure 31.3.

You may wish to design your own search interface. If so, your form needs to present the following three parameters to the search script:

  Terms—A text string containing one or more words
  Boolean—The Boolean AND or OR
  Case—Whether the search should be case sensitive or not case insensitive


FIGURE 31.3  You can use the generic form provided with Matt’s Search Engine to allow user input.


Make sure you use the POST method to call Matt’s Simple Search Engine. If you use GET, the script won’t work because Matt’s script reads form input from <STDIN>.

The result of a search using Matt’s Simple Search Engine interface will look similar to that shown in Figure 31.4.


FIGURE 31.4  The results page from Matt’s Search Engine provides links to the found pages.

Notice that each found page is represented by a link to that page. The search terms are also provided, along with the Boolean and case sensitivity settings.

Matt’s script works fine and is fairly fast. It took 3 CPU seconds and about 10 elapsed seconds to search about 250 files on my site.

Some desirable features are lacking, however—for example, only the titles of found files are displayed. No context indicates whether the search term is merely mentioned in the file or whether significant information about the term is contained in it. When presented with a list of dozens of files as the result of a search, with no way to distinguish between them, users may become weary of trying to find the information and visit a different site.

File titles are presented in no particular order, which is not very helpful in determining their relevance.

The results also do not indicate how many times a search term was found in a particular file or, in the case of multiple-word search terms, whether the words were found in close proximity. The user has no control over partial matches such as finding “state” within “estate” and “intestate.” Whatever the user types becomes the search string.

In addition, various implementation problems exist with this simple search engine. Because it does not support recursion, control over which directories are searched rests entirely in the hands of the Webmaster, who must remember to add new directories to the variable in the script file. Files or directories also are not easily excluded from a search. In addition, no limit is placed on the number of files that can be returned, nor are stop words ignored. Given the way that directories must be explicitly specified, this may not seem to be a big drawback, but what if you have painstakingly added all directories on your site to the script and someone searches for the word “the”? A better way is definitely needed to control the directories that are searched.

Fortunately, Htgrep satisfies many of these objections.


Previous Table of Contents Next


Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.