by Don Benage
This chapter introduces you to two relatively new members of the Microsoft BackOffice family. Index Server works hand in hand with Internet Information Server (IIS) by building indexes of the content that is published by IIS. Queries, both simple and advanced, can then be created by simply filling out a form using your Web browser. The contents of the form are processed by IIS and a search is made for matching content. The results are sorted, formatted, and returned to the user.
The Content Replication System (CRS) provides an important capability: copying Web content from one server to another. There are many reasons why this is necessary in a typical corporate intranet or professional Web site. These reasons are outlined, and the features provided by CRS are explored. In addition, this chapter provides the key planning concepts and network architecture of CRS. The procedures for testing, monitoring, and maintaining CRS are also provided.
Index Server provides content indexing for both IIS and Peer Web Services (PWS), the Web server component provided by Windows NT Workstation. After it has been installed and configured, it will automatically maintain up-to-date indexes of the content that is stored on a Web server. Index Server is largely self maintaining. There are no complicated maintenance procedures, and the product is designed to run unattended 24 hours a day, seven days a week. As new content is added to the Web server, it automatically updates its indexes in a incremental fashion by incorporating any new entries required without having to re-index all content on the Web server.
Index Server will not only index Hypertext Markup Language (HTML) Web pages, but also documents created with Microsoft Office products. It is capable of "seeing" the contents of the following types of documents:
In addition, it will index binary files based on their ActiveX properties but, of course, it cannot scan the contents of such files. Other document types can be indexed by creating custom document filters, but they are not supported by the standard filters included with the product.
The resulting index that is created is based not only on keywords in the text,
but also on the Microsoft Office properties (summary and custom) or ActiveX properties
of the file or document. All Microsoft Office document formats include a number of
standard properties, such as Title, Subject, Author, Category, and Keywords. In general,
the properties of a Microsoft Office document can be viewed by selecting File,
Properties. In addition to the standard properties for a document,
custom properties can be added by users.
Several sample query forms are provided that can be used to search for content that contains a particular keyword or phrase (see Figure 21.1). This sample query shows only the most rudimentary form of searching that is available. Custom query forms can be created that make it easy for users to formulate complex queries. These queries can perform searches using document properties, including custom properties that are tailored to specific applications if desired.
FIG. 21.1
Sample query forms can be used without modification to search for documents
or Web pages on your indexed servers.
Index Server is designed to run virtually maintenance free. Unless you need to create filters for custom document types, you may never need to worry about the underlying processes that make this product work. However, at some time something may go wrong that requires troubleshooting and an understanding of the way indexes are created and queries are resolved can be useful. In addition, you may simply be curious about how content and property-based indexing is accomplished. An overview is presented here with additional details provided in the following sections. Some details are suppressed for clarity. See the product documentation for full details on all processes.
The process begins when a document is added to an indexed directory on a server. A scanning process recognizes that a new file has been added and invokes another process called CiDaemon. The document is analyzed and then filtered using an appropriate filter dynamic-link library (DLL) and a word-breaker DLL. (Index Server handles multiple languages, and uses different language-dependent rules to determine what to index.) The filtering process identifies keywords and properties that are extracted out of the document and added to the random-access memory (RAM) resident word lists. These word lists are subsequently incorporated into shadow indexes, which are eventually incorporated into the master index.
At some time, a user will open a query form and submit a query to the Index Server engine. The query is processed using information from the query form and a special type of file called an Internet data query file (which carries an .idq extension). The query is processed, and the returned results are formatted as an Hypertext Markup Language (HTML) page with the aid of another type of file called an HTML extension file (.htx file). The results are presented to the user's Web browser for display.
By creating custom query forms or custom .idq and .htx files, the query process and the format of the results can be tailored to suit particular needs. The default forms and files are suitable for general purpose indexing and reporting.
Anyone who has spent time "surfing" the World Wide Web has probably had an opportunity to use one of the search engines. These professionally run sites provide sophisticated searching capabilities based on the same type of indexing that is possible with Index Server. Search engines, such as Yahoo, Lycos, WebCrawler, and AltaVista, provide content indexes at thousands of Web sites on the Internet. Often, after a Web site is found with one of these large search engines, you still need to find a particular page or document of interest.
Web sites that have a local search engine capability make it much easier to find exactly the subject matter you are after. You can use Index Server to provide such a search capability for both public Web sites and intranet sites that you manage. As the amount of subject matter that you include in your site grows, this capability will quickly become a necessity rather than a nice extra feature.
If you have used a public search engine or have other database experience, you already have a good idea what queries are all about. Information is entered into a form specifying what you are interested in finding. This form is then submitted to the Index Server engine for processing. The queries must be expressed in a query language that has many powerful features. Index Server's query language supports the following capabilities:
This list provides the basics of forming queries that Index Server can resolve. It is possible to create custom forms that simplify the process of formulating queries for a particular subject matter area. This is especially desirable if the user community at your organization is unfamiliar with query processing. However, many sites will not need to customize the query process at all.
The mechanics of actually resolving the query involve the use of some special files. The original query is combined with information in an Internet data query (.idq) file. This file specifies how the query is to be processed. There are two possible sections in an .idq file: the names section and the query section. The names section is optional and is used only to define nonstandard column names that can be referred to in the query. This section is not needed for standard query processing. The query section is used to specify parameters that are used when the query is processed.
Parameters in the .IDQ file are specified in a variable=value format. A variety of parameters are available to control the behavior of query processing. For example, the location of your catalog that contains all indexes created by Index Server is specified in a variable called CiCatalog. Another variable, CiMaxRecordsInResultSet, controls how much information can be returned as results. The variable CiColumns controls the columns that are returned in the results page, and should match the columns referenced in the .htx file that is used to format the results.
HTML extension files (with a .HTX extension) are used to format the results. These files are created using HTML with conditional statements based on the variables defined and created in the .idq file that was used for the query being handled. Depending on your interest in customizing the query process, it can be enlightening to review the sample files provided to see how they are designed. The most basic query is handled with the following files, assuming you have accepted all defaults during installation:
Now that you have been introduced to the query process, the action that occurs behind the scenes to index the content on your Web servers is described. The next few sections track the various actions that occur when new content is added to an indexed directory on your Web server, culminating in a set of entries in the master index.
Indexing starts with a scanning process. By default, all virtual roots defined on your IIS server will be indexed. You can add additional virtual roots and include them in the indexing process as needed. You can also exclude virtual roots from indexing if you desire. These virtual roots can be directories on the Web server machine itself, or shared directories on other servers.
Windows NT supports automatic change notification and will initiate the scanning process when new files are added to the server. Other file servers (e.g., Windows 95 and Novell NetWare) do not support this feature, and must wait for a periodic, scheduled scan to occur. A Registry entry (ForcedNetPathScanInterval), which can be configured by the administrator, controls the frequency of these scans.
There are two types of scans that are performed by Index Server: incremental scan and full scan. The first time a directory is scanned, a full scan of all contents is performed. Thereafter, only an incremental scan is necessary to accommodate the changes that have occurred. Occasionally, an additional full scan could be necessary. For example, after a server suffers a catastrophic failure, a full scan is needed. No administrator intervention is normally needed, even for this type of recovery operation. Index Server has been designed to recover from failures automatically unless an unusual circumstance should occur and go undetected (e.g., Registry corruption). A full scan or an incremental scan can be forced at any time by the administrator.
Once a directory has been scanned, a three-step filtering process takes place, described as follows:
1. A filter DLL examines the document and extracts the text and properties that are accessible. For many binary files (that don't have a corresponding custom filter), only properties would be able to be extracted.
2. A second DLL, known as a word-breaker DLL, parses the text and any textual properties into words.
3. The resulting list of words is compared with the list of noise words, which are removed and discarded. The remaining words will be processed and included in the index.
Filtering occurs under the direction of the CiDaemon process, which is spawned by the Index Server engine (see the following Note). It must analyze the list of documents that have been scanned and sent for indexing to determine which is the appropriate filter DLL and which is the appropriate word-breaker DLL. As previously mentioned, different filter DLLs and word-breaker DLLs are required to handle documents of different types and in different languages.
NOTE: Daemon is another name for a background process that runs without requiring user intervention. The term is most commonly used in UNIX environments. In a Windows NT environment, the term service has roughly the same meaning and is used much more frequently.
In addition to generating words to be merged, the filtering process also generates a characterization. This is a short summary of the item being indexed that can aid the user in deciding if this is a document or file that is of interest. The Registry key GenerateCharacterization is set to 1 (by default). If this entry is set to 0, characterizations will not be generated.
Index Server is designed to be operational 24 hours a day, seven days a week. Therefore, it does most of its work in the background and attempts to work only when the server is idle. It also closes any documents that it is processing as quickly as possible if they are requested by another user or application. Filtering of that document will automatically be retried later.
CAUTION: If directories on shared network drives (on another server) are being indexed, the files that are opened by Index Server will not be quickly released when another user or process requests them. This feature (quickly releasing files needed elsewhere) is not available when indexing shared network drives. Therefore, the filtering process may temporarily hold a file lock on a document while it is being filtered. Use discretion when deciding which directories should be indexed.
To avoid interfering with other more urgent processes on a server, the CiDaemon process runs in the idle priority class by default. In other words, it filters documents only when there is no other work of a higher priority to perform. If you intend to use Index Server on a fairly busy computer, this could result in lengthy delays before documents are filtered and a subsequent backlog will occur. To increase the priority of the CiDaemon processóunderstanding that this may impact the throughput of other work on this serveróyou can set the ThreadPriorityFilter Registry key to THREAD_PRIORITY_NORMAL and the ThreadClassFilter to NORMAL_PRIORITY_CLASS.
CAUTION: Changing any of the Registry entries as described in this chapter, especially the priority level, should be approached with extreme care. Improperly editing the Registry can result in corruption of information and the need to reinstall the operating system and restore the most recent backup. This operation should be attempted only by experienced administrators, and only after a current backup is made and the RDISK utility is run to create an updated repair disk.
When the filtering process is complete, the resulting word lists are merged as described in the next section.
Indexes are used in many different computer applications. There are many different types of indexes for different purposes. The indexes created by Index Server are designed for the purpose of rapidly resolving the search queries used when trying to locate documents or other content on Web servers. The words and properties that have been extracted during the filtering process by CiDaemon are merged into a permanent index that is stored on disk.
Because Index Server must operate in an environment in which many other activities are being performed (potentially) on the same machine concurrently with its operationsóincluding the need to resolve queries based on the current content and indexes that already existóa multi-step process is used that culminates in a single, up-to-date index. Depending on the load placed on the server and the amount of new information that is being added, there are intermediate stages that result in a more complex state than a single index.
As already described, the filtering process results in word lists. These can be thought of as mini-indexes for a small collection of documents. They are stored in RAM as they await further processing. If a power loss occurs, these word lists are lost, but Index Server is designed to recover automatically from such an event. Because they exist in RAM, the creation of a word list is very fast.
Word lists are merged to form shadow indexes. These are stored on disk and will, therefore, survive a power loss. More than one shadow index can exist in the catalog, which is the directory containing all indexes for an Index Server. The process of merging word lists (and occasionally other shadow indexes) to form a shadow index is called a shadow merge. During the shadow merge process, additional compression is performed on the information stored in word lists to further optimize storage and retrieval.
A master merge is eventually performed to create the master index. During this process, all shadow indexes and the current master index are merged to create a new master index. At any given moment, there is only one master index. If the server has had an opportunity to "get caught up," then there will not be any shadow indexes or word listsójust the master index. In other words, if there is sufficient processing power and no new documents are added for a period of time, the natural progression of things will result in a single master index and no other index structures. As new documents are added, the process starts again. Index Server is capable of operating properly in any intermediate state, but is most efficient when working with just the (complete) master index.
The total number of indexes on a very busy server can grow as high as 255. If the server is so busy that even more shadow indexes would be required, the server will fail and some reconfiguration will be required to provide faster disk subsystems, additional CPU power, or other additional resources so that it can accommodate the load required. A master merge on a very active machine can be a complex and lengthy process. The automatic recovery capability of Index Server includes even this complex operation. System failure in the midst of a master merge operation is fully and automatically recoverable.
Now that you know how Index Server operates, it is time to learn how to install and use this powerful tool. The following procedure assumes that you have already installed Windows NT Server and IIS. In addition, you should be logged on with administrative rights to the machine, which is set up as an Index Server. If you want to index the contents of other file servers or Web servers, you should define additional virtual directories on the IIS WWW service using the Internet Service Manager. You can add additional virtual directories at a later time if you prefer. For more information about defining virtual directories, see Chapter 18, "Building a Web with Internet Information Server (IIS)."
To install Index Server, follow these steps:
FIG. 21.2
This dialog box is used to specify the location of IIS scripts.
Once you have installed Index Server, you naturally will be eager to test its functionality. If you already have an operational Web server with an interesting collection of content, all you need to do is wait. Index Server's operations are automatic, and the scanning, filtering, merging, and index creation process occurs without further intervention. Depending on the amount of content and the load placed on the server, you should allow anywhere from 10 minutes to several hours for the indexing process to produce useful results.
You can start by reviewing the online Index Server Guide. This is accessible by
choosing Start, Program, Microsoft Index Server, Index Server Online
Documentation (see Figure 21.3). Alternatively you can connect to this Web page by
manually entering the URL (http://<servername>/srchadm/help/default.htm
by default).
Index Server includes online documentation in HTML format.
Once an appropriate period of time has elapsed, you are ready to try a search.
Choose Start, Program, Microsoft Index Server, Index Server Sample
Query Form (see Figure 21.4). This Web page is the sample query form that is discussed
in the introduction to this chapter. It enables you to test the search capabilities
of Index Server.
The sample query form provided with Index Server is a working search page that can be used without modification or customized to meet specific needs.
To test your Index Server, follow these steps:
Source control systems, used in the past to coordinate the orderly interaction of a group of programmers working together on a body of computer code, have been pressed into duty to manage Web content. These systems not only enable users to "check out" and "check in" files, but also track revisions and even restore an older version if the latest becomes corrupted. Microsoft's SourceSafe is an example of this genre, and a white paper is available on the Microsoft Web site (http://www.microsoft.com/ssafe) describing the use of this product for Web content management.
A new entry in the Web management repertoire is the Microsoft Content Replication System (CRS). This product is designed to move Web content from one computer to another. There are a variety of scenarios in which the product can be used, and a number of methods that are supported. This section does the following:
Although the product could be used to replicate arbitrary information, it is specifically engineered to be used in a Web server environment. For example, one option enables you to specify the content to be replicated by providing a Uniform Resource Locator (URL)óbasically, a Web address.
In this section, a variety of scenarios involving the need for content replication are explored. In addition, the role that CRS can play is outlined in order to familiarize you with the type of work this product can perform. Because this is a relatively new product category that is much less familiar than others (e.g., word processing), it is important to spend a little time understanding how it is used before actually deploying it in a production environment.
If you have a modest Web site with under 50 pages managed by a single author, these scenarios will not reflect your environment. When your site grows to hundreds or thousands of pages with dozens of authors, some content management tools are necessary.
A simple example is presented first. Only two computers are involved in this scenario, although it could be a component in a much larger architecture. Figure 21.7 represents a Web content developer's desktop computer linked to a Web server. The content developer may be using a variety of tools to create Web pages and test them on a Peer Web Server, such as that provided by Windows NT Workstation. The tools used to create the pages is unimportant in terms of CRS.
In this simple content replication scenario, Web content is pulled from a desktop computer to a server.
On the same network is a Web server: a Windows NT Server running IIS. The same server is also running CRS. This server would typically be a more powerful machine than the Web developer's desktop computer, and may be locked in a wiring closet for security reasons. Because of its greater power and security, it is a suitable platform for sharing information with a large number of concurrent users and, therefore, an appropriate target to place the Web content that has been developed.
At regular intervalsóperhaps every night or once a weekócontent is pulled (copied) from the developer's computer to the Web server. The only action required by the developer is to make sure he has placed finished Web pages at the designated link location specified in the replication. This could even be a simple "under construction" page, but it should never be content that yields errors. A partially finished page or series of pages that are actively being developed should be kept in a different location during construction. The developer can point his own browser at this temporary location for viewing and testing links, then copy to the pickup location when he is satisfied with the results.
All activities performed by CRS are managed as projects. Even a very complex architecture involving many servers and other computers in locations around the world can be broken down into a collection of projects. In this first scenario, all that is needed is a simple pull replication project. To create the project, the URL of the Peer Web Server being used by the content developer is provided and the destination to which the content should be copied. This location should be an active URL on the Web server that will act as the final destination of this content.
One of the key advantages of using CRS, even in this very simple scenario, is the capability to automate the administrative task of moving the content to the active Web server. The Web developer is presumably involved in this activity regularlyóperhaps even full timeóbut it is a nuisance to require intervention by an administrator in order to move the content to the server. However, it may be unacceptable to provide administrative access to the server to a group of Web developers. By automating the process, the content is moved to the right place, simply and securely, at a regularly designated interval.
The next scenario builds on the first, and demonstrates a situation in which CRS plays a more valuable role. In this example, shown in Figure 21.8, the content on a single Web server is replicated to three additional servers. Depending on the network architecture and links between the servers, it may be possible to copy the information simultaneously to the three servers at once. This is a feature of CRS that can be useful if the links between the servers are deemed reliable.
Content from one server can be replicated to multiple target servers. This can occur simultaneously in some situations.
If the links between servers are subject to regular outages, a frame copy can be used. With this feature, content is broken down into frames and sent with error correcting protocols that can detect if a frame has become garbled during transmission. In addition, replication projects that are interrupted can be restarted at the point of disruption; they do not need to restart at the beginning as required when using standard file transfer protocols. This can be a critical feature if you are replicating very large files or a large collection of small files over the Internet or private WAN.
Figure 21.9 shows this scenario developed even further. The content is replicated from a single originating server to a staging server. From here, it is replicated over the Internet in a series of discrete projects to three other Web servers in different locations. This distribution is done for the purpose of providing content at a location near its intended audience. Although in theory any Internet user can reach any public server, reliability and throughput will be enhanced by local availability if you intend to serve a large population of users.
A relatively sophisticated replication scenario involving multiple locations with multiple servers at each is shown here. A staging server is used as a distribution point.
At each of these locations, the content is further replicated to multiple local servers for the purpose of providing scalability and redundancy. If a single server fails, the site as a whole is still available due to the availability of one or more backup servers. In addition, the load can be distributed among the servers to provide improved responsiveness during peak access times.
The final scenario is an example involving a very large corporate Web site. In Figure 21.10, each department is responsible for creating Web content describing its products or services. A central group of Webmasters creates the primary home page (www.companyname.com) and manages the overall infrastructure of servers and replication projects.
The maintenance of a large corporate Web site can be automated by using CRS.
Each department can be assigned a primary URL that corresponds to a virtual root on the corporate Web server. This location is referenced by a link from the main home page. It is also the target of a content replication project that moves content from a departmental staging server to the primary Web servers at regular intervals. The department's own Web content developers can build whatever structure they want within their own pages and can refer to other well-defined URLs in other departmental pages. If references are made to another department's content other than the virtual root, care must be taken to ensure that the URLs don't change without notification.
The primary administrative chore then becomes the maintenance of CRS projects. CRS lends itself to automatic monitoring through its support of Performance Monitor counters. Alerts can be created in Performance Monitor to notify the appropriate administrators by launching a batch file or application. You could use this mechanism to send an e-mail message or trigger a beeper.
There are additional features for automatic server and process monitoring built into SQL Server and Exchange Server that can be used to monitor the services that implement CRS, create entries in the Windows NT event log, and restart services or even reboot servers automatically. For more information about automating network administration tasks, see Chapter 50, "Proactive Network Administration."
In this section, you learn how to install CRS on a server. If you intend to use the command-line interface exclusively, then any Windows NT server with sufficiently powerful components (processor, disk drive subsystem, and network interface) will suffice. CRS can be run on either Windows NT Server or Windows NT Workstation. Clearly, if the CRS system is also intended to act as a Web server (as opposed to just a staging server for information that is being moved), it must be running IIS.
Most people will want to take advantage of the CRS Web Administration tool, even if they occasionally use the command-line interface for auxiliary tasks or to confirm the status of a project. This tool is somewhat different from other BackOffice administration tools. It is Web browser-based, which a growing number of BackOffice products are adding, but this is still fairly new. Also, it uses a different style of buttons and controls than you may be used to if you manage other BackOffice products. You may also find it necessary to use the Refresh button to update the display more often with this tool than most others in the BackOffice administration suite, primarily because of its Web-based design. It is, however, relatively easy to learn and use.
NOTE: In order to use the Web Administration tool, you must install CRS on a Windows NT 4.0 system that has been configured to use the NTFS file system. In order to ensure security, the CRS Web Administration tool is not supported on disk drives configured with FAT partitions.
As with all BackOffice products, administrative tasks can be initiated from client computers connected to the server over the network. In order to run the CRS Web Administration tool on a client computer, you must be running either Microsoft Internet Explorer 3.01 (or later) or Netscape Navigator 3.0 (or later).
In order to administer CRS, you must be able to access the Web Administration tool through IIS. This is subject to IIS security restrictions as described in Chapter 18, "Building a Web With Internet Information Server (IIS)." In addition, you must be an administrator for the Windows NT system that runs CRS. You should also create a service account to run CRS services using the User Manager for Domains utility. For more information about Windows NT security, see Chapter 7, "Administering Windows NT Server."
To install CRS, follow these steps:
The CRS Start Page provides an introduction to CRS and links you to the Web Administration tool.
You are now ready to use CRS to move Web content on your network. This is described in the next section.
There are two interfaces that can be used to control CRS. The first is a Web browser-based interface that uses JScript applications to create and monitor the status of projects. The second is a comprehensive command-line interface. Both of these interfaces are described in this section.
Remember that all replication events are managed as projects óno matter how complicated an architecture you want to create. These projects operate on pairs of servers, either pulling or pushing information from one server to the other. Even large replication architectures spanning global networks are based on this simple concept. In the next two sections you will learn how to create both push and pull projects, the basic ingredients for all replication scenarios.
It is a good idea to experiment with CRS in a lab environment before deploying it on your production servers. This is generally true of all BackOffice products, but can be especially important with a product like CRS that is capable of moving very large amounts of information, and consequently having a big impact on network bandwidth and server performance. When it comes time to implement CRS on your production environment, you should have a good idea what will happen, based on the tests you have performed. You want to know how long a typical operation will take, the best time of day to perform that operation, and the impact a replication project will have on active Web users.
A pull project is designed to connect to a source URL (a Web address) and copy the content is finds there to a specified target directory on the server that is running the pull project. The server "pulls" the content from the source, hence the name.
You can either request that all content at the source be pulled, or only a specified number of levels be copied. If you specify a limit of two levels, for example, the initial page (typically default.htm or index.htm, depending on how the server is configured) will be copied, and the links on that page will be followed. The content of the linked pages will be copied, and the links contained in them will be followed and copied. This is what is meant by two levels deep.
To create a pull project using the Web Administration tool, follow these steps:
A push project operates in a reverse manner from a pull project. It is initiated on the source server, and "pushes" content to one or more destination servers. In addition, a push project must be defined on both the source and destination servers, although only the source server is configured for a target (destination).
To create a push project using the Web Administration tool, follow these steps:
This brief introduction to the Content Replication System should give you a solid basis on which to build additional knowledge and experience with this new member of the Microsoft BackOffice family. This product is only in its first release, and it will undoubtedly continue to be enhanced with additional features and capabilities. With the growing use of the Web, and the challenges inherent in managing the large volume of content necessary for a great Web site, the importance of this type of tool will multiply.
This chapter presented the features available in two Microsoft BackOffice family components: Index Server and the Content Replication System. It provided detailed information on installing, configuring, and using these components to help manage your Web site. For more information about the topics addressed in this chapter, see the following chapters:
© Copyright, Macmillan Computer Publishing. All rights reserved.