Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

Evaluating Search Results

When you perform a search, not only do you want information, you want information relevant to the question at hand. In a perfect world, every search returns only information that completely satisfies your request—no more, no less. But this is not a perfect world, so how can do you compare the effectiveness of various searches? Two major parameters are commonly used to judge the results of a search:

• Recall indicates what fraction of the relevant documents are retrieved by the query.

• Precision measures the degree to which the returned documents satisfy the request.

Each query can be graded as a fraction, with a perfect score being 1.00. In that mythical perfect world, every search would score a 1.00 on both measures because only relevant documents would be retrieved, and those documents would be exactly what you are looking for. Assume that you have a site containing 100 documents, for example, and of these 100, 10 are about search engines. If a query is made for “Perl-based search engines,” the query might retrieve four documents about search engines and two others about Perl. In this case, the search would have a precision of 0.66 (four of the six documents returned were relevant) and a recall of 0.40 (4 of a possible 10 relevant documents were returned).

Search engines use a variety of search strategies to increase recall and precision, and some of them are quite complex. The following sections examine various search techniques used to find potential matches, as well as weighting methods used to rank these results in an attempt to present the most likely matches first.

Some of the following material is adapted from Rod Clark’s excellent discussion in Special Edition Using CGI (Que Publishing, 1996).

Understanding Search Techniques

People think and remember in imprecise terms. You might ask a friend, “Tell me everything about piranha mating,” or “Who were the six actors who portrayed James Bond?” Depending on the company you keep, you might get accurate answers. You are likely to get unsatisfactory answers, however, from today’s search engines. Unfortunately, conventional query syntax follows very precise rules, even for simple queries. Search engines are evolving toward being capable of handling natural language queries, but there is a long way to go. Toward that end, search engines commonly use the following search techniques to enhance recall and precision.

Substringing Suppose a friend mentions a reference to “dogs romping in a field.” It could be that what he actually saw, months ago, was the phrase “while three collies merrily romped in an open field.” In a very literal search system, searching for “dogs romping” would turn up nothing at all. “Dogs” are not “collies,” and “romping” is not “romped.”

If you entered the query “romp field,” however, you might get the exact reference if the search tool understands substrings. A substring is part of the a string—but figuring out which part is meaningful isn’t easy. The search engine takes the word “romp” and searches for it, as well as its variants: romps, romped, and romping and even brompton. Obviously, language-specific rules are required to generate the variants from the root word.

Stemming Some search engines, but by no means all, offer stemming. Stemming is related to substringing, but involves an even greater understanding of the language. Rather than requiring the user to enter root terms in a query, stemming involves trimming a query word to its root and then looking for other words that match the same root. The word “wallpaper,” for example, has “wall” as its root word; so does “wallboard,” which the user might never have entered as a separate query. When a user enters “wallpaper,” a stemmed search might serve up unwanted additional references to “wallflower,” “wallbanger,” “Wally,” and “walled city,” but it would also catch “wall” and “wallboard” and probably provide useful information that way.

Stemming has at least the following two advantages over plain substring searching:

• It doesn’t require the user to mentally determine and then manually enter the root words.

• It allows assigning higher relevance scores to results that exactly match the entered query and lower relevance scores to the other stemmed variants.

But stemming is also language specific: The rules of stemming in English, for example, are quite different from those for German or Finnish. Human languages are complex, and a search program can’t just trim English suffixes from words in another language.

Thesauri One way to broaden the reach of a search is to use a thesaurus, a separate file that links words with lists of their common equivalents. Most thesauri enable you to add special words and terms, either linked to a dictionary or directly to synonyms. A thesaurus-based search engine automatically looks up words related to the terms in your submitted query and then searches for those related words. If you publish several technical briefs on the cellular mitosis, for example, a thesaurus-based search engine would show your articles under biology and physiology as well as cytology.

Pattern Matching Building specific language rules into a search engine is difficult. What happens when the program encounters documents in a language that it hasn’t seen before?

Several newer search engines concentrate on some more general techniques that are not language based. Some of these tools can analyze a file, even if it is in an unknown language or file format, and then search for similar files. The key to this kind of search is matching patterns within the files rather than matching the contents of the files.

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.