home account info subscribe login search My ITKnowledge FAQ/help site map contact us


 
Brief Full
 Advanced
      Search
 Search Tips
To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Bookmark It

Search this book:
 
Previous Table of Contents Next


Searching with OraCom WebSite Server

O’Reilly’s WebSite server for Windows NT includes the company’s WebIndex indexing and WebFind searching tools. WebIndex can index the full text of every page in the server’s directory structure or only selected parts of the directories. WebFind runs as a CGI program and is a conventional search tool. It does keyword searches and supports AND and OR operators.

O’Reilly publishes a book (or manual), Building Your Own WebSite, that goes into considerable detail about setting up and using its WebSite server. Before you install the company’s software, you can read all about it at

http://www.ora.com/

The following site is running WebSite and has set up several search databases:

http://www.videoflicks.com/

Searching with Netscape SuiteSpot Servers

Netscape SuiteSpot Standard and Professional Editions run on Windows NT and UNIX. They include a built-in indexing and searching system, although Netscape’s lower-priced FastTrack Server does not. You can find out more about Netscape’s servers at

http://home.netscape.com/servers/index.html

Searching with Microsoft Index Server

Designed for zero maintenance and complete Web-site indexing, Microsoft’s Index Server search engine supports multiple languages (Dutch, U.S. and International English, French, German, Italian, Spanish, and Swedish) and attempts to index by content type as well as contents. It can index documents in several formats: text in a Microsoft Word document, statistics on a Microsoft Excel spreadsheet, or the content of an HTML page. Index Server enables the user to search using both keywords and content types. You may read about Index Server and download a free copy at

http://www.microsoft.com/ntserver/info/indexserver.htm

Index Server requires NT 4.0 and is designed to work with Microsoft’s Internet Information Server (IIS).

Considerations when Adding a Search Engine to Your Site

For purposes of this discussion, you have evaluated the alternatives and decided that you need to implement a site-resident search engine. Perhaps you don’t like the idea of sending your users off to a commercial index, and your Web server doesn’t have a built-in search capability, or maybe you just want more control. Before you get started, however, you should consider the type of search engine you want to use.

Indexing Versus Grepping Search Engines

Two main types of approaches can be used for creating an online search facility for your Web site:

  Indexing—Using this method, you periodically run a process that examines and pulls out keywords from every document on the entire Web site. The main advantage of this method is speed. When a user does a search, the search engine needs to look only at the index instead of searching every file on the site. A disadvantage of this approach is timeliness because a user’s search can be only as current as your last index.
  Grepping—Using this method, you provide a search engine that searches all files on your site each time the user performs a search. The term grepping is taken from the UNIX grep facility that enables users to search for keywords within files. Timeliness is a major advantage of this approach because the user is searching the actual files on your site and any changes are automatically reflected. The major disadvantages of this method are performance and high resource utilization. Because each search touches every file on your site, searches can run for a long time and consume significant server resources.

Indexing search engines predigest your Web site and create indexes containing all its words. The major commercial Web search engines, such as AltaVista, Lycos, Excite, and Web Crawler, are all indexing engines. In fact, it is not practical to have a search engine that searches the whole Web with the grepping method. To accomplish this, the search engine would have to either add the full text of every site to a database or search every site in real-time.

With an indexing search engine, when a user requests a search, the search engine needs to refer only to the index to find relevant pages. Because indexes are often a small fraction of the size of the documents indexed, this takes much less time. More important, such an approach makes the major commercial search engines practical by enabling them to store only the indexes of sites rather than site images.

Indexing search engines generally employ more sophisticated searching algorithms to improve their chances of returning relevant documents.

Although easy to implement, most grepping search engines are somewhat limited in the types of search queries they support. Grepping, after all, is a rather brute-force method of searching. Each file is opened and then scanned for the search terms. The amount of system resources consumed by these activities can limit the sophistication of the search strategies. Most grepping engines use only simple keyword searches, although some offer searching via regular expressions.

To determine which searching method to employ, you must first decide what kinds of search services you want to offer and how many resources—both disk space and processor time—to dedicate to those services.

Evaluating Performance and Processor Efficiency As you might imagine, a big difference exists between the performance and efficiency of grepping and indexing search engines.


NOTE:  Performing a grepping search on one section of my site, which contains about 600 average-size files, takes 8 CPU seconds and 40 elapsed seconds on a Sun Sparcstation 20. Because our user load is not very high, this is an acceptable amount of overhead for a search. If your site is very busy, however, with a hundred simultaneous users, for example, it is probably not feasible to dedicate this amount of resources to user searching.

In contrast, performing an index-based search on the same site takes about one CPU second and five or six elapsed seconds. Because the size of the site is not large, about 90MB, and indexes average between 10% and 20% of the total size of the site, the amount of disk overhead is acceptable. Because the information on our site doesn’t change much from day to day, I can run the indexing software overnight and provide a day-old index for our users to search.


One approach that adds no disk overhead and a small amount of processor overhead is to have someone else maintain your index and run the search process. An example of this approach is Pinpoint, from Netcreations (http://www.netcreations.com/pinpoint/). This commercial service sends its robot to your site about once a month. The site index is maintained on the Netcreations site, and it also maintains and runs the search engine. You maintain a query form on your site that points to the Pinpoint URL. Some trade-offs, of course, exist for this type of solution. You give up a lot of control over what is indexed, how it is indexed, and how often the index is updated. In addition, performance of search queries is likely to be slower when conducted over the Internet.


Previous Table of Contents Next


Products |  Contact Us |  About Us |  Privacy  |  Ad Info  |  Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.