|
To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
Evaluating Search ResultsWhen you perform a search, not only do you want information, you want information relevant to the question at hand. In a perfect world, every search returns only information that completely satisfies your requestno more, no less. But this is not a perfect world, so how can do you compare the effectiveness of various searches? Two major parameters are commonly used to judge the results of a search:
Each query can be graded as a fraction, with a perfect score being 1.00. In that mythical perfect world, every search would score a 1.00 on both measures because only relevant documents would be retrieved, and those documents would be exactly what you are looking for. Assume that you have a site containing 100 documents, for example, and of these 100, 10 are about search engines. If a query is made for Perl-based search engines, the query might retrieve four documents about search engines and two others about Perl. In this case, the search would have a precision of 0.66 (four of the six documents returned were relevant) and a recall of 0.40 (4 of a possible 10 relevant documents were returned). Search engines use a variety of search strategies to increase recall and precision, and some of them are quite complex. The following sections examine various search techniques used to find potential matches, as well as weighting methods used to rank these results in an attempt to present the most likely matches first. Some of the following material is adapted from Rod Clarks excellent discussion in Special Edition Using CGI (Que Publishing, 1996). Understanding Search TechniquesPeople think and remember in imprecise terms. You might ask a friend, Tell me everything about piranha mating, or Who were the six actors who portrayed James Bond? Depending on the company you keep, you might get accurate answers. You are likely to get unsatisfactory answers, however, from todays search engines. Unfortunately, conventional query syntax follows very precise rules, even for simple queries. Search engines are evolving toward being capable of handling natural language queries, but there is a long way to go. Toward that end, search engines commonly use the following search techniques to enhance recall and precision. Substringing Suppose a friend mentions a reference to dogs romping in a field. It could be that what he actually saw, months ago, was the phrase while three collies merrily romped in an open field. In a very literal search system, searching for dogs romping would turn up nothing at all. Dogs are not collies, and romping is not romped. If you entered the query romp field, however, you might get the exact reference if the search tool understands substrings. A substring is part of the a stringbut figuring out which part is meaningful isnt easy. The search engine takes the word romp and searches for it, as well as its variants: romps, romped, and romping and even brompton. Obviously, language-specific rules are required to generate the variants from the root word. Stemming Some search engines, but by no means all, offer stemming. Stemming is related to substringing, but involves an even greater understanding of the language. Rather than requiring the user to enter root terms in a query, stemming involves trimming a query word to its root and then looking for other words that match the same root. The word wallpaper, for example, has wall as its root word; so does wallboard, which the user might never have entered as a separate query. When a user enters wallpaper, a stemmed search might serve up unwanted additional references to wallflower, wallbanger, Wally, and walled city, but it would also catch wall and wallboard and probably provide useful information that way. Stemming has at least the following two advantages over plain substring searching:
But stemming is also language specific: The rules of stemming in English, for example, are quite different from those for German or Finnish. Human languages are complex, and a search program cant just trim English suffixes from words in another language. Thesauri One way to broaden the reach of a search is to use a thesaurus, a separate file that links words with lists of their common equivalents. Most thesauri enable you to add special words and terms, either linked to a dictionary or directly to synonyms. A thesaurus-based search engine automatically looks up words related to the terms in your submitted query and then searches for those related words. If you publish several technical briefs on the cellular mitosis, for example, a thesaurus-based search engine would show your articles under biology and physiology as well as cytology. Pattern Matching Building specific language rules into a search engine is difficult. What happens when the program encounters documents in a language that it hasnt seen before? Several newer search engines concentrate on some more general techniques that are not language based. Some of these tools can analyze a file, even if it is in an unknown language or file format, and then search for similar files. The key to this kind of search is matching patterns within the files rather than matching the contents of the files.
|
Products | Contact Us | About Us | Privacy | Ad Info | Home
Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement. |