Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

CHAPTER 31
Indexing and Adding an Online Search Engine

by Mike Ellsworth and Melissa Niles

In this chapter

Understanding Searching 748

Indexing Your Own Site 753

Leveraging Commercial Indexes 756

Using Web Servers’ Built-In Search Tools 760

Considerations when Adding a Search Engine to Your Site 761

Making the Decision 766

Implementing a Grepping Search Engine 767

Implementing an Indexing Search Engine 783

Understanding Searching

How can you find what you’re looking for? If you are looking for something on the Web, chances are you will use a search engine instead of plodding through dozens or hundreds of pages in hopes of uncovering items of interest. These search services are made possible by programs known as robots, spiders, Web crawlers, or worms; they are on the job 24 hours a day, 365 days a year. They do nothing but wander around from site to site, reading and cataloging whatever they find. They store the results of their searches in huge databases, which anyone can access.

That’s the easy part: assembling the haystack. But how to find that needle? This section examines search techniques used by search engines to sift through masses of information to bring back results that satisfy your request.

Understanding Literal Searching

Many search engines use a technique called full-text indexing and retrieving. Full-text refers to the fact that each word in each document scanned becomes part of the index. Listings 31.1, 31.2, and 31.3 show three files that might be included in an index.

Listing 31.1 Holidays.txt—Sample Text File #1

Holiday Schedule
New Year’s Day, Monday, January 2.
Memorial Day, Monday 29 May
July 4th, Independence Day, Thursday

Listing 31.2 Birthdays.txt—Sample Text File #2

John, Jan 17 (Thursday this year)
Mary, May 29

Listing 31.3 Taxes.txt—Sample Text File #3

Fiscal year ends 31 December
Expect big write-off in May or June
Estimates due July 1

Suppose you search for any file containing the word “Jan.” The search engine would return Holidays.txt (which has “January”) and Birthdays.txt (which has “Jan”). You would not see Taxes.txt because “Jan” doesn’t appear anywhere in it. If you ask for “May,” you will get back all three files because all three contain the word “May.”

Commonly Used Search Qualifiers
“” (quotation marks)—Specify that documents must contain the exact phrase within the quotation marks: “history of computing”

AND—Indicates a search for documents containing all the terms joined by the operator: UNIX AND Solaris

OR—Indicates a search for documents containing any of the terms joined by the operator: SCO OR HP/UX

NOT—Excludes documents containing the term that follows the operator: UNIX NOT Solaris

NEAR—Finds documents in which two words appear within a certain number of words of each other: Linux NEAR UNIX

+ (plus sign)—Placed at the beginning of a word to indicate that the word is required: UNIX +NetBSD

– (minus sign)—Placed at the beginning of a word to indicate that the word must be excluded: UNIX –SCO

If you ask for anything containing either “February” or “tax,” the search engine will return Taxes.txt. Although none of the files contains the word “February,” the Taxes.txt file contains the word “tax” as part of the title. This satisfies the request for either the first word or the second word. This kind of search is called a Boolean OR.

If you search for both “May” and “29,” you will see Birthdays.txt and Holidays.txt. At this point, you will find out that Mary’s birthday is on Memorial Day this year. Taxes.txt contains the word “May,” but not the number “29,” so the file fails the “find files with the first term and the second term” test. This kind of search is called a Boolean AND.

You can stretch a bit by asking for only files that have both “May” and “29,” but not “Mary.” A Boolean expression might state this search as follows:

((May AND 29) AND (NOT Mary))

This search first finds files matching the first term (it must have both “May” and “29”), and then excludes files having “Mary,” leaving only Holidays.txt as the result. Suppose the search expression had been:

((May AND 29) OR (NOT Mary))

The search engine would have found all three files under this search expression. The Holidays.txt file is included because it has both “May” and “29”; the Birthdays.txt file is included for the same reason; and the Taxes.txt file shows up because it doesn’t have the word “Mary.”

A full-text index is obviously very powerful. Even in this limited example, you can clearly see the usefulness and flexibility of this kind of tool. Yet in a large database of files, thousands might include the word “May.” If the database includes source code files, hundreds of thousands of references to “29” might be found. Wouldn’t it be nice to find only dates that look like birthdays, or the word “May,” but only if it is near the word “29,” and not in any source code files?

Advanced search engines go one step beyond literal Boolean searches and give you the means to do more.

Advanced Searches

Advanced searching techniques go beyond literal matching. A search that doesn’t rely on exact matches is often called a fuzzy search. It is not based on Boolean algebra, with its mixture of AND, OR, and NOT operators, although these might come into play if appropriate. Instead, it tries to identify concepts and patterns and deal with information rather than data.

Feel the Heat
Information is data that has been assigned meaning by a human. In a simple example, “It’s 98 degrees” is data, whereas “It’s hot” is information. As the amount of data on the Internet grows, the importance of distinguishing information from data skyrockets.

The ultimate artificial-intelligence search engine would have a DWIM, or “Do What I Mean” command. Putting data in context with other data is one way to derive information. Human language abounds with contextual references and implied scopes.

When you say “It’s hot,” for example, you probably don’t mean “Somewhere in the world the temperature is such that someone might refer to it as hot.”

You mean that you’re feeling hot right now, regardless of the actual temperature.

The context and scope of your original statement is implied; the concomitant associations derive from the context, your knowledge of human behavior in general, and your behavior in particular.

If you searched the Internet for “Hot Babes” (not that you would ever do so), you would be disappointed if you got back pointers to the National Weather Service’s reports mingled with articles about infant care. How can search engines figure out what kind of “hot” you mean? Can DWIM ever be achieved?

This question is a hot topic—the basis for an ongoing and bitter debate among philologists, linguists, artificial-intelligence theorists, and natural-language programmers. Almost as many sides exist as participants in the debate, and no one view clearly outstrips the rest. If you are interested in this sort of debate, check out the comp.ai.fuzzy newsgroup on Usenet, or stop by your local library or favorite online search engine and find references to AI and natural language.

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.

CHAPTER 31Indexing and Adding an Online Search Engine

Understanding Searching

Understanding Literal Searching

Advanced Searches

CHAPTER 31
Indexing and Adding an Online Search Engine