To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98
CHAPTER 31 Indexing and Adding an Online Search Engine
by Mike Ellsworth and Melissa Niles
- In this chapter
- Understanding Searching 748
- Indexing Your Own Site 753
- Leveraging Commercial Indexes 756
- Using Web Servers Built-In Search Tools 760
- Considerations when Adding a Search Engine to Your Site 761
- Making the Decision 766
- Implementing a Grepping Search Engine 767
- Implementing an Indexing Search Engine 783
Understanding Searching
How can you find what youre looking for? If you are looking for something on the Web, chances are you will use a search engine instead of plodding through dozens or hundreds of pages in hopes of uncovering items of interest. These search services are made possible by programs known as robots, spiders, Web crawlers, or worms; they are on the job 24 hours a day, 365 days a year. They do nothing but wander around from site to site, reading and cataloging whatever they find. They store the results of their searches in huge databases, which anyone can access.
Thats the easy part: assembling the haystack. But how to find that needle? This section examines search techniques used by search engines to sift through masses of information to bring back results that satisfy your request.
Understanding Literal Searching
Many search engines use a technique called full-text indexing and retrieving. Full-text refers to the fact that each word in each document scanned becomes part of the index. Listings 31.1, 31.2, and 31.3 show three files that might be included in an index.
Listing 31.1 Holidays.txtSample Text File #1
Holiday Schedule
New Years Day, Monday, January 2.
Memorial Day, Monday 29 May
July 4th, Independence Day, Thursday
Listing 31.2 Birthdays.txtSample Text File #2
John, Jan 17 (Thursday this year)
Mary, May 29
Listing 31.3 Taxes.txtSample Text File #3
Fiscal year ends 31 December
Expect big write-off in May or June
Estimates due July 1
Suppose you search for any file containing the word Jan. The search engine would return Holidays.txt (which has January) and Birthdays.txt (which has Jan). You would not see Taxes.txt because Jan doesnt appear anywhere in it. If you ask for May, you will get back all three files because all three contain the word May.
Commonly Used Search Qualifiers
(quotation marks)Specify that documents must contain the exact phrase within the quotation marks: history of computing
ANDIndicates a search for documents containing all the terms joined by the operator: UNIX AND Solaris
ORIndicates a search for documents containing any of the terms joined by the operator: SCO OR HP/UX
NOTExcludes documents containing the term that follows the operator: UNIX NOT Solaris
NEARFinds documents in which two words appear within a certain number of words of each other: Linux NEAR UNIX
+ (plus sign)Placed at the beginning of a word to indicate that the word is required: UNIX +NetBSD
(minus sign)Placed at the beginning of a word to indicate that the word must be excluded: UNIX SCO
If you ask for anything containing either February or tax, the search engine will return Taxes.txt. Although none of the files contains the word February, the Taxes.txt file contains the word tax as part of the title. This satisfies the request for either the first word or the second word. This kind of search is called a Boolean OR.
If you search for both May and 29, you will see Birthdays.txt and Holidays.txt. At this point, you will find out that Marys birthday is on Memorial Day this year. Taxes.txt contains the word May, but not the number 29, so the file fails the find files with the first term and the second term test. This kind of search is called a Boolean AND.
You can stretch a bit by asking for only files that have both May and 29, but not Mary. A Boolean expression might state this search as follows:
((May AND 29) AND (NOT Mary))
This search first finds files matching the first term (it must have both May and 29), and then excludes files having Mary, leaving only Holidays.txt as the result. Suppose the search expression had been:
((May AND 29) OR (NOT Mary))
The search engine would have found all three files under this search expression. The Holidays.txt file is included because it has both May and 29; the Birthdays.txt file is included for the same reason; and the Taxes.txt file shows up because it doesnt have the word Mary.
A full-text index is obviously very powerful. Even in this limited example, you can clearly see the usefulness and flexibility of this kind of tool. Yet in a large database of files, thousands might include the word May. If the database includes source code files, hundreds of thousands of references to 29 might be found. Wouldnt it be nice to find only dates that look like birthdays, or the word May, but only if it is near the word 29, and not in any source code files?
Advanced search engines go one step beyond literal Boolean searches and give you the means to do more.
Advanced Searches
Advanced searching techniques go beyond literal matching. A search that doesnt rely on exact matches is often called a fuzzy search. It is not based on Boolean algebra, with its mixture of AND, OR, and NOT operators, although these might come into play if appropriate. Instead, it tries to identify concepts and patterns and deal with information rather than data.
Feel the Heat
Information is data that has been assigned meaning by a human. In a simple example, Its 98 degrees is data, whereas Its hot is information. As the amount of data on the Internet grows, the importance of distinguishing information from data skyrockets.
The ultimate artificial-intelligence search engine would have a DWIM, or Do What I Mean command. Putting data in context with other data is one way to derive information. Human language abounds with contextual references and implied scopes.
When you say Its hot, for example, you probably dont mean Somewhere in the world the temperature is such that someone might refer to it as hot.
You mean that youre feeling hot right now, regardless of the actual temperature.
The context and scope of your original statement is implied; the concomitant associations derive from the context, your knowledge of human behavior in general, and your behavior in particular.
If you searched the Internet for Hot Babes (not that you would ever do so), you would be disappointed if you got back pointers to the National Weather Services reports mingled with articles about infant care. How can search engines figure out what kind of hot you mean? Can DWIM ever be achieved?
This question is a hot topicthe basis for an ongoing and bitter debate among philologists, linguists, artificial-intelligence theorists, and natural-language programmers. Almost as many sides exist as participants in the debate, and no one view clearly outstrips the rest. If you are interested in this sort of debate, check out the comp.ai.fuzzy newsgroup on Usenet, or stop by your local library or favorite online search engine and find references to AI and natural language.
|