home account info subscribe login search My ITKnowledge FAQ/help site map contact us


 
Brief Full
 Advanced
      Search
 Search Tips
To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Bookmark It

Search this book:
 
Previous Table of Contents Next


You may also worry about the security aspects of turning over so much information about your site to a third party. In reality, however, many other third parties already index your site (AltaVista, Lycos, Excite, and all the rest). If you are going to worry, you might as well worry about them. When a third party provides an important service such as this for your site, you are giving up a lot of control. You are trusting the third party to maintain the search engine, as well as to only index those sections of the site that you want your users to see. You are also trusting them to maintain a timely index of your site and to be available to your users constantly. If these compromises work for you, this approach is quick and easy. If, on the other hand, you don’t want to give up control of such an important function of your site, you should consider implementing your own search engine.

Evaluating Complexity of Searching Your choice of a search engine depends, in part, on how complex the searches on your site are likely to be. If relevant documents can be found with the use of a simple keyword, not much difference exists between the grepping and the indexing approaches. If, on the other hand, the average user wants to implement multiple-word searches or searches involving concepts other than keywords, an indexing search engine is the better choice.

In general, indexing search engines can accomplish more complex searches than grepping engines. A grepping engine basically does string compares. It may support regular expressions, wildcards, fuzzy and normal Boolean matching, but it is difficult to implement more sophisticated context matching or concept searching in this type of engine. The sheer overhead of a grepping engine makes it difficult to do multipass searching of any kind.

Using an indexing approach, a search engine can spend more time examining the relationships between search terms and found pages. Because the engine doesn’t need to burn processor time churning through all the pages in a site, it can offer nice features such as relevancy ranking and concept searching.

Understanding Indexing Issues

Issues to consider when evaluating an indexing search engine include the following:

  Resource usage—The size of the index, the speed of search, and the impact on the CPU
  Handling of “stop” words—How the engine deals with commonly occurring words such as “the,” “a,” and “an”
  Control over indexed material—How files to index are included or excluded from the process

Comparing Index Size and Speed Typically, the larger the index, the longer it takes to search. Most indexing search engines create indexes that are a small fraction of the size of the material to be searched (usually between 10% and 20%). If your site is massive, however, you need to consider whether the index can fit in memory all at once or whether your server needs to swap it in and out as the engine does its searching. Excessive disk thrashing dramatically slows the search process and may even affect overall server performance.

If yours is a high-volume site and your users do a lot of searches, you may need to consider holding the index in memory or even limiting the number of simultaneous searches.

One way to reduce the size of the index on your site is to exclude certain common words from the index. By default, most indexing engines exclude words known as stop words—commonly occurring articles and pronouns, for example. But additional noise words may be on your site that you may not want to include in your index (the name of your organization, for example). Excluding words such as these reduces the size of the index and improves searching efficiency.

Understanding Stop Words Most indexing search engines have some capability to ignore stop words, also known as garbage or noise words. These are commonly occurring words such as articles, pronouns, and many adjectives. The indexing engine should ignore such words when indexing, and the query engine should discard them from the search terms when performing a search. Table 31.1 lists commonly used stop words.


Previous Table of Contents Next


Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.