Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

Any time you add a piece of software to your site, you need to be concerned with its impact on site security. Can the software be overwhelmed by an attack and provide direct access to the site? Does it offer a way for users to execute programs on your server? Before releasing a search engine for production use, you may want to experiment with it, to try to overwhelm it or get it to produce unpredictable results.

CAUTION:
Be aware of security concerns regarding implementations of Perl on Windows NT. See http://www.perl.com/perl/news/latro-announce.html for more information.

The potential for users to use your search engine to execute arbitrary code on your Web server is obviously a serious security concern. If the search engine uses the Perl eval command to perform the search, you need to be sure to screen search terms to remove potentially harmful characters and code before passing them to the search engine. On UNIX systems, this means preventing the user from entering a search term containing the escape symbol (!) or any commands that could be used to invoke a command interpreter (!sh, for example).

Even if your search engine doesn’t offer a security hole, you still need to be sure that users can’t see information on your site that they usually are prevented from seeing. It is common on sites using the NCSA Web server, for example, to use access control files (typically .htaccess) to control access to sensitive directories. If the search engine ignores these access control files, it can return links to or summaries of the files contained in protected directories. At best, your users will be frustrated at seeing links that they are prevented from following. At worst, file summaries can compromise the confidentiality of protected information.

And finally, a security concern that is really a resource concern: You may want to limit the amount of resources any one user of your search engine can consume, or the number of simultaneous searches that can occur. A malicious user can bring your server to its knees by launching a large number of time-consuming searches. Most search engines do provide a method of controlling access in this way. You may need to use other system-management tools to regulate search engine use.

Making the Decision

Which search engine you select depends in part on whether you prefer the timely, but resource-hungry, grepping approach or the faster, CPU-friendly, indexing approach. Regardless of the approach you pick, you should evaluate several requirements before selecting your engine:

• How easy is it to maintain?
Indexing engines take more maintenance by their very nature. But if maintaining your search engine means remembering to update variables or rerun indexes when new information is added, you need to decide if you’re willing to spend the time. By the same token, if your grepping engine looks at all directories on your site, you need to keep that in mind when creating new directories. The best search engine is probably one that you can set and forget.

• Does it automatically recurse directories?
This is a security question closely related to maintenance concerns. If the engine needs to be told explicitly what to search for, you will spend more time maintaining it. If it automatically searches new directories, you need to be aware of sensitive or password-protected information when creating new ones.

• Does it honor access control files?
These files are a simple way to control access to information on your site, but if your search tool gives users access to these files or their summaries, security is breached. At best, users are frustrated if they cannot access the files that turn up in an index.

• Does it reject searches for garbage, noise, or stop words?
No matter which type of engine you select, you don’t want to waste resources running down all instances of the word “the” on your site.

• Does it allow for complex searches?
A good search engine will at least allow for Boolean searches and searches that are not case-sensitive. The capability to search for regular expressions is also desirable. More sophisticated engines evaluate word proximity or enable users to search on concepts.

• Does it index offsite links?
You have to make up your own mind as to whether you want your engine to index such links.

• Does it provide a context so that the user can evaluate the suitability of the found file?
At a minimum, the search engine needs to offer a hyperlink to the relevant file. It is more helpful, however, if a summary of the file is available, especially if the files are large.

• Does it present search results in small groups or in one big list?
To avoid overwhelming your users with a huge results page that takes forever to download, some control is needed. The engine can either present the results in small groups, offering a link to the next set, or enable the user or Webmaster to control the number of files returned by any one search.

• Does it enable you to capture information about what users are searching for?
You can better design your site to serve your users if you know just what they are looking for. Data on user searches can be a very important tool in determining the organization of your site.

These are just some of the questions you should ask yourself as you plan to add a search capability to your site. The discussion that follows examines how well various approaches satisfy these requirements.

Implementing a Grepping Search Engine

Grepping search engines share a common methodology: Start at an arbitrary point in the directory tree, open each HTML file in the tree, and search the file for the search term. Optionally, the engine might recursively follow each subsequent directory branch encountered and repeat the search process.

This allows for unsophisticated searches, although it is possible to enable support for searches using regular expressions.

Building Your Own Grepping Search Engine

To help you better understand how grepping search engines work, this section shows how you can use the Perl language to build your own. In building your own grepping search engine, you will need to tackle two problems: finding files to search, and searching those files for search terms.

First, it is important to examine the problem of finding files to search. Using a couple of key Perl capabilities, it is easy to build a recursive routine that will identify the types of files contained within a directory tree, perform an operation on them, and continue the process with underlying directories. The Perl script in Listing 31.7 demonstrates this approach.

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.