Platinum Edition Using HTML 4, XML, and Java 1.2:Indexing and Adding an Online Search Engine

To access the contents, click the chapter and section titles.

Platinum Edition Using HTML 4, XML, and Java 1.2
(Publisher: Macmillan Computer Publishing)
Author(s): Eric Ladd
ISBN: 078971759x
Publication Date: 11/01/98

Table of Contents

Implementing the Htgrep Search Engine Htgrep, written by Oscar Nierstrasz, can be obtained at http://iamwww.unibe.ch/~scg/Src/Doc/htgrep.html or in the Software Composition Group Software Archives at http://iamwww.unibe.ch/~scg/Src/.

The major differences between Htgrep and Matt’s script is that Htgrep automatically recurses subdirectories, and it supports Boolean AND searches as well as case-sensitive searches.

After you have installed the Perl script htgrep.pl and the associated scripts find.pl, html.pl, and bib.pl, you configure the base directory by changing a variable at the beginning of htgrep.pl. Other variables you configure include the path to users’ public HTML directories and any pseudo URLs (URLs that have been aliased) that you want included in the search.

Included in the package is a basic search form and a basic CGI wrapper script that you can use to control the behavior ofhtgrep.pl. The CGI wrapper appears on the accompanying CD-ROM as htgrep.cgi.

You will need to modify the wrapper to configure the location of your Perl library files. The CGI wrapper assumes that find.pl, which was used in an earlier example, is located in the library. You can find the find.pl program in the Htgrep distribution, if you don’t already have it.


	The Htgrep wrapper script enables you to use either the POST method or the GET method to process the form. It first looks for information from a POST, using $ENV{‘PATH_INFO’}, and then from a GET, using $ENV{‘QUERY_STRING’}.

After you have configured the CGI wrapper, you need to build a form for your users to specify parameters. The form provided with the distribution appears in Listing 31.12.

Listing 31.12 Htform.txt—Sample Form for Use with HTGREP

<H2>Generic Form</H2>

<FORM ACTION=”/~scg/cgi-bin/htgrep.cgi”>
<P>
<INPUT
     NAME=”file”
     SIZE=30
     VALUE=”/~scg/Src/Doc/htgrep.html”
>
<!
     VALUE=”/~scg/Src/Doc/htgrep.html”
!>
<B>File to search</B> (relative to WWW home)
<BR>
<INPUT NAME=”isindex” SIZE=30>
<B>Query</B>
<INPUT TYPE=”submit” VALUE=”Submit”>
<INPUT TYPE=”reset” VALUE=”Reset”>

<DL>

<DT><B>Query style:</B>
<DD>
<INPUT type=”checkbox” name=”case” value=”yes”>
Case Sensitive
<DD>
<INPUT type=”radio” name=”boolean” value=”auto” checked=”yes”>
Automatic Keyword/Regex
<INPUT type=”radio” name=”boolean” value=”yes”>
Multiple Keywords
<INPUT type=”radio” name=”boolean” value=”no”>
Regular Expression

<DT><B>HTML Files:</B>
<DD>
<INPUT type=”radio” name=”style” value=”none” checked=”yes”>
Ordinary Paragraphs
<INPUT type=”radio” name=”style” value=”ol”>
Numbered list
<INPUT type=”radio” name=”style” value=”ul”>
Bullet list
<INPUT type=”radio” name=”style” value=”dl”>
Description list

<DT><B>Plain Text:</B>
<INPUT type=”radio” name=”style” value=”pre”>
(preformatted)
<DD>
<INPUT type=”checkbox” name=”grab” value=”yes”>
Make URLs live (works with plain text only)

<DT><B>Refer Bibliography files:</B>
<INPUT type=”checkbox” name=”refer” value=”yes”>
<DD>
<INPUT type=”checkbox” name=”abstract” value=”yes”>
Show Abstract
<INPUT type=”checkbox” name=”ftpstyle” value=”dir”>
Link to directories, not files (for refer files)
<DD>
<INPUT type=”radio” name=”style” value=”ul”>
Bullet list (instead of numbered)

<DT><B>Max records to return:</B>
<INPUT NAME=”max” VALUE=”250" SIZE=10>
</DL>

</FORM>

This code produces a form similar to the one shown in Figure 31.5.

FIGURE 31.5 You can use the generic form provided with Htgrep to allow user input.

A welcome feature of Htgrep is its support for regular expressions. Although most users are probably not well-versed in the use of regular expressions, most at least can understand using the asterisk to fill out portions of words. Additionally, unless you use regular expressions, Htgrep searches on whole words, which is a nice feature.

Using the default search form, you can also determine the format of the resulting hits page—either full paragraphs or various types of listings.

NOTE: The capability to return full paragraphs was the key in my decision to use Htgrep on my site. Because a high proportion of the words that users are likely to search for occur in many documents on the site, I felt it was important to provide this context to help guide users to relevant pages quickly and easily.

To enable the return of entire paragraphs from a search, Htgrep takes a different approach to finding text in files. Rather than assembling one huge string from all the lines in the files, Htgrep enables you to specify a record delimiter and then searches each record in a file. You may decide, for example, that you want HTML paragraph tags (<P>) to be your record delimiter. It is the record orientation of the search that allows Htgrep to return the context for a search hit. Htgrep returns the entire record in which it found the search term. The user thus sees the entire paragraph and can better determine whether the page meets his or her needs. Htgrep does this by using Perl’s capability to define a record delimiter. This is demonstrated in the following code fragment:

# the default record separator is a blank line
#$separator = “”;
$separator = “<P>”;
[. . .]
     # normally records are separated by blank lines
     # if linemode is set, there is one record per line
     if ($tags{‘linemode’} =~ /yes/i) { $/ = “\n”; }
     else { $/ = “$separator”; }

Table of Contents

Products | Contact Us | About Us | Privacy | Ad Info | Home

Use of this site is subject to certain Terms & Conditions, Copyright © 1996-2000 EarthWeb Inc.
All rights reserved. Reproduction whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Read EarthWeb's privacy statement.