Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

Even if you don’t know or can’t explain the rules for constructing the patterns that you see—whether those patterns are in human language, graphics, or binary code—you can still rank them for similarity. “Yes, this one matches.” Or “No, that one doesn’t. This one is very similar, but not exact. This one matches a little. This one is more exact than that one.” To analyze files for content similarity, keyword nearness, and other such qualities, some of the newer search engines look for patterns. Such engines use fuzzy logic and a variety of weighting schemes.

The theory behind sophisticated pattern analysis is far beyond the scope of this book. A good explanation of just the algorithms, sans theory, would cover several chapters. You should be aware that these techniques exist, however, and that some of the indexing engines you will encounter use variants of these techniques to enhance their searching power.

Understanding Weighting Methods

The search engine has done its job, but it has brought back dozens, hundreds, or thousands of items that might be what you’re looking for. Scrolling through all this material looking for something truly relevant is probably not what you would like to spend your time doing. It is common, therefore, for indexing search engines to assign confidence factors or weights to the documents returned from a search and to use these measures to rank the list of documents. That way, if you’re lucky, what you’re looking for is close at hand, and not at the end of a long list.

Common methods for establishing weights include evaluating adjacency, frequency, and relevance.

Adjacency Adjacency is a type of phrase searching method that examines the relationship between words in the search phrase. The search engine increases the relevance score based on how closely the words in the search term occur in the target document. If you search for the phrase “hearing aids,” the search engine can use adjacency to determine that you aren’t interested in documents containing the phrase “Senate hearing on medical research on AIDS.”

Obviously, adjacency only comes into play when more than one search term is used. Yet, findings by Webcrawler (see http://info.webcrawler.com/bp/WWW94.html) indicate that the average search comprises only 1.5 words. If you can encourage your users to specify search phrases, however, a good indexing engine can employ adjacency to increase the effectiveness of the search.

Frequency Indexing search engines can use the frequency of hits on search terms within a page to increase the page’s relevancy score. If you’re like most fans of the Blue Devils, it is far more likely that you are interested in a page that lists “Duke Blue Devils” seven times than in a page that only contains one mention of the phrase. The former page is much more likely to be an article about the subject; the other could just be a listing of teams or a passing mention.

Relevance Feedback Relevance feedback is a form of query by example. With this method, a user first performs a search using normal search terms. The user samples one or more of the found documents and determines whether a particular document is close to what he or she wants. The user can inform the search engine to “find more documents like this one.” The search engine then parses the relevant document and uses its profile to perform another search.

Relevance feedback can be an especially powerful means of searching. Instead of using the one or two search terms the user originally provides, the search is done using all the keywords from the found document.

Indexing Your Own Site

So far in this chapter, you have learned the theory behind site searching and have seen the techniques used by search engines to improve search effectiveness. In this section, you learn how to improve the accuracy of site indexing. This enables you to maximize the effectiveness of the search engine, whether you use an external, commercial engine, or implement your own.

Using Keywords

Before you start studying indexing programs and individual search engines, you need to examine the kinds of information that you can provide for the indexers to index. Some of the code examples and supporting text in this section are adapted from Rod Clark’s excellent discussion in Special Edition Using CGI.

Adding keywords to files is particularly important when using simple search tools, many of which are very literal. These tools need all the help they can get.

Manually adding keywords to existing files is a slow and tedious process. Doing so isn’t particularly practical when you are faced with a mountain of seldom-read archival documents. When you first create new documents that you know people will search online, however, you can stamp them with an appropriate set of keywords. This stamping (or keying) provides a consistent set of words that people can use to search for the material in related texts, in case the exact wording in each text doesn’t happen to include some of the relevant general keywords. Using equivalent nontechnical terminology that users are likely to understand also helps.

Sophisticated search engines can yield good results when searching documents with little or no intentional keying, but well-keyed files produce better and more-focused results with these search tools. Even the best search engines, when they set out to catch all the random, scattered, unkeyed documents that you want to find, return information that is liberally diluted with noise—irrelevant data. Keying your files helps keep them from being missed in relevant lists for closely related topics.

Using Keywords in Plain Text To help find HTML pages, you can add an inconspicuous line at the bottom of each page that lists the keywords you want, like this:

Poland Czechoslovakia Czech Republic Slovakia Romania Rumania

This line is useful, but ugly and distracting. Also, many search engines assign a higher relevance to words in titles, headings, emphasized text, <A NAME=...> tags and other areas that stand out from a document’s body. The next few sections consider how to key your files in more sophisticated and effective ways.

Using Keywords and Descriptions in HTML META Tags You can put more information than simply the page title in an HTML page’s <HEAD>...</HEAD> section. Specifically, you can include a description of your page, or a standard Keywords list in a META tag. Some confusion seems to exist about how to implement these tags. META tags that include HTTP-EQUIV as part of the statement in the tag are considered to be part of the HTTP header. You can insert these tags in your HEAD section, and the browser is supposed to interpret them as if they were a true HTTP header. To change the date a page expires, for example, add a META tag similar to the following:

<META HTTP-EQUIV=”Expires” CONTENT=”Thu, 01 Jan 1998 12:00:00 GMT”>

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.