|
To access the contents, click the chapter and section titles.
Platinum Edition Using HTML 4, XML, and Java 1.2
The sections that follow focus on implementing the following five indexing search engines:
Implementing WebGlimpseGLIMPSE (which stands for GLobal IMPlicit SEarch) and its Web companion, WebGlimpse, are projects of the University of Arizonas Computer Science Department. WebGlimpse is available for free for nonprofit use. A small licensing fee is charged for commercial users. The University has recently developed a new program called the Search Broker. The Search Broker forwards your query to a search engine dealing specifically with the subject of your question, which you specify as the first word of your query. A recent search of the Web turned up hundreds of sites that are using this popular tool or its precursor, GlimpseHTTP. A partial list of sites is available at http://glimpse.cs.arizona.edu/ghttp/sites.html. GLIMPSE is also used as a basis for Harvest Information Discovery and Access System (http://harvest.cs.colorado.edu/). As the name implies, the program displays glimpses of context samples from the files. This makes it a particularly useful tool, even though it doesnt offer relevance ranking. GLIMPSE is available at You can obtain WebGlimpse at
The distribution comprises GLIMPSE, written in C, glimpseindex, another C program that creates the index, the webglimpse script itself, written in Perl, and an assortment of Perl utilities that you use to create and manage your indexes. Installation is mostly automated but definitely not foolproof. Sometimes several attempts are needed to get it installed smoothly. After it is installed, you need to run a Perl script that creates the WebGlimpse index using glimpseindex. GLIMPSE can build indexes of several sizes, from tiny (about 1% of the size of the source files) to large (up to 30% of the size of the source files). Even small indexes are practical and offer good performance. Other welcome features include the capability to index pages that have been added only since the last index, a facility to index offsite links, the capability to set a tolerance for spelling errors, and the capability to establish neighborhoods. Neighborhoods are defined as all links within an arbitrary number of hops from a page or all pages within a directory.
After the index has been established, you can use a cron job (a program which executes applications for you at defined times) to run it periodically to maintain it. The installation routine even creates the job for you. Using the WebGlimpse Perl script (created by the install) to perform searches is easy. After aliasing to the proper directory, you call the script with a parameter that indicates where the index resides. The user sees a basic search form if the script is called directly. Alternatively, you can include either of two code fragments in your Web pages to provide a nicer looking interface. The two interface styles are created using the HTML code fragments in Listing 31.15. Listing 31.15 Glimform.txtTwo Forms for Calling WebGlimpse <H2>Basic WebGlimpse Interface</H2> <CENTER> <TABLE border=5><TR border=0> <TD align=center valign=middle> <A HREF=http://glimpse.cs.arizona.edu/webglimpse> <IMG src=/images/glimpse-eye.jpg alt=WG align=middle width=50><BR> <FONT size=-3>WebGlimpse</FONT></A></TD> <TD> <FORM method=get ACTION=/$CGIBIN/webglimpse$ARCHIVEPWD> <INPUT NAME=query size=20> <INPUT TYPE=submit VALUE=Search> <INPUT name=file type=hidden value=$FILE> <A HREF=/$CGIBIN/webglimpse-fullsearch$ARCHIVEPWD?file=$FILE> Search Options</A></TD></TR> <TR><TD colspan=2> Search: <INPUT TYPE=radio NAME=scope VALUE=neighbor CHECKED> The neighborhood of this page <INPUT TYPE=radio NAME=scope VALUE=full>The full archive </TD></TR></FORM></TABLE></CENTER><HR> <H2>Full-Featured WebGlimpse Interface</H2> <TABLE border=5> <TR><TD align=center valign=middle> <A HREF=http://glimpse.cs.arizona.edu/webglimpse> <IMG src=/images/glimpse-eye.jpg align=middle></TD> <TD align=center valign=middle> <A HREF=http://glimpse.cs.arizona.edu/webglimpse> <FONT size=+3>WebGlimpse </A> Search<BR></FONT></TD> </TR> <TR><TD colspan=2> <FORM method=get ACTION=> <INPUT name=file type=hidden value=/home/msmith/public_html/big/index.html> Search: <INPUT TYPE=radio NAME=scope VALUE=neighbor> The neighborhood of <Ahref=>the ACNielsen Web Site </A> <INPUT TYPE=radio NAME=scope VALUE=full CHECKED>The full archive: <AHREF=>the ACNielsen Site including links offsite</A> </TD></TR> <TR><TD colspan=2> String to search for: <INPUT NAME=query size=30> <INPUT TYPE=submit VALUE=Submit> <BR> <CENTER> <INPUT NAME=case TYPE=checkbox>Case&#160;sensitive <!SPACES>&#160;&#160;&#160; <INPUT NAME=whole TYPE=checkbox>Partial&#160;match <!SPACES>&#160;&#160;&#160; <INPUT NAME=lines TYPE=checkbox>Jump&#160;to&#160;line <!SPACES>&#160;&#160;&#160; <SELECT NAME=errors align=right> <OPTION>0 <OPTION>1 <OPTION>2 </SELECT> misspellings&#160;allowed <BR> </CENTER> Return only files modified within the last <INPUT NAME=age size=5> days. <BR> Maximum number of files returned: <SELECT NAME=maxfiles> <OPTION>10 <OPTION selected>50 <OPTION>100 <OPTION>1000 </SELECT> <BR>Maximum number of matches per file returned: <SELECT NAME=maxlines> <OPTION>10 <OPTION selected>30 <OPTION>50 <OPTION>500 </SELECT> <BR> </FORM> </TD></TR> <TR><TD colspan=2> <CENTER> <FONT size=-2><A HREF=http://glimpse.cs.arizona.edu> Glimpse</A> and <A HREF=http://glimpse.cs.arizona.edu/webglimpse> WebGlimpse</A>, Copyright &copy; 1996, Arizona Board of Regents. </CENTER> </FONT></TD></TR> </TABLE></CENTER> </CENTER>
|
Products | Contact Us | About Us | Privacy | Ad Info | Home
Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc. All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement. |