Special Edition Using Microsoft BackOffice, Volume I

Chapter 21 Implementing Index Server and the Content Replication System

by Don Benage

Index Server overview

Discover the features of Index Server and the manner in which it works with Internet Information Server (IIS).

Creating a query

Learn the procedures for creating queriesóboth simple and advanced.

Index Server architecture and administration

Although this product is largely self-maintaining, there are a few things you should know about configuring Index Server and how it operates.

Content Replication System (CRS) overview

Review a description of the Content Replication System and its operational characteristics.

CRS network architecture

Explore the proper way to deploy CRS in a small corporate intranet or the international Internet environment.

CRS maintenance

After installing and testing CRS, learn how to monitor and maintain it during daily operations.

This chapter introduces you to two relatively new members of the Microsoft BackOffice family. Index Server works hand in hand with Internet Information Server (IIS) by building indexes of the content that is published by IIS. Queries, both simple and advanced, can then be created by simply filling out a form using your Web browser. The contents of the form are processed by IIS and a search is made for matching content. The results are sorted, formatted, and returned to the user.

The Content Replication System (CRS) provides an important capability: copying Web content from one server to another. There are many reasons why this is necessary in a typical corporate intranet or professional Web site. These reasons are outlined, and the features provided by CRS are explored. In addition, this chapter provides the key planning concepts and network architecture of CRS. The procedures for testing, monitoring, and maintaining CRS are also provided.

An Index Server Overview

Index Server provides content indexing for both IIS and Peer Web Services (PWS), the Web server component provided by Windows NT Workstation. After it has been installed and configured, it will automatically maintain up-to-date indexes of the content that is stored on a Web server. Index Server is largely self maintaining. There are no complicated maintenance procedures, and the product is designed to run unattended 24 hours a day, seven days a week. As new content is added to the Web server, it automatically updates its indexes in a incremental fashion by incorporating any new entries required without having to re-index all content on the Web server.

Index Server will not only index Hypertext Markup Language (HTML) Web pages, but also documents created with Microsoft Office products. It is capable of "seeing" the contents of the following types of documents:

Microsoft Word
Microsoft Excel
Microsoft PowerPoint
HTML
Plain (ASCII) text

In addition, it will index binary files based on their ActiveX properties but, of course, it cannot scan the contents of such files. Other document types can be indexed by creating custom document filters, but they are not supported by the standard filters included with the product.

The resulting index that is created is based not only on keywords in the text, but also on the Microsoft Office properties (summary and custom) or ActiveX properties of the file or document. All Microsoft Office document formats include a number of standard properties, such as Title, Subject, Author, Category, and Keywords. In general, the properties of a Microsoft Office document can be viewed by selecting File, Properties. In addition to the standard properties for a document, custom properties can be added by users.

Several sample query forms are provided that can be used to search for content that contains a particular keyword or phrase (see Figure 21.1). This sample query shows only the most rudimentary form of searching that is available. Custom query forms can be created that make it easy for users to formulate complex queries. These queries can perform searches using document properties, including custom properties that are tailored to specific applications if desired.

FIG. 21.1
Sample query forms can be used without modification to search for documents or Web pages on your indexed servers.

Index Server Architecture

Index Server is designed to run virtually maintenance free. Unless you need to create filters for custom document types, you may never need to worry about the underlying processes that make this product work. However, at some time something may go wrong that requires troubleshooting and an understanding of the way indexes are created and queries are resolved can be useful. In addition, you may simply be curious about how content and property-based indexing is accomplished. An overview is presented here with additional details provided in the following sections. Some details are suppressed for clarity. See the product documentation for full details on all processes.

The process begins when a document is added to an indexed directory on a server. A scanning process recognizes that a new file has been added and invokes another process called CiDaemon. The document is analyzed and then filtered using an appropriate filter dynamic-link library (DLL) and a word-breaker DLL. (Index Server handles multiple languages, and uses different language-dependent rules to determine what to index.) The filtering process identifies keywords and properties that are extracted out of the document and added to the random-access memory (RAM) resident word lists. These word lists are subsequently incorporated into shadow indexes, which are eventually incorporated into the master index.

At some time, a user will open a query form and submit a query to the Index Server engine. The query is processed using information from the query form and a special type of file called an Internet data query file (which carries an .idq extension). The query is processed, and the returned results are formatted as an Hypertext Markup Language (HTML) page with the aid of another type of file called an HTML extension file (.htx file). The results are presented to the user's Web browser for display.

By creating custom query forms or custom .idq and .htx files, the query process and the format of the results can be tailored to suit particular needs. The default forms and files are suitable for general purpose indexing and reporting.

Queries

Anyone who has spent time "surfing" the World Wide Web has probably had an opportunity to use one of the search engines. These professionally run sites provide sophisticated searching capabilities based on the same type of indexing that is possible with Index Server. Search engines, such as Yahoo, Lycos, WebCrawler, and AltaVista, provide content indexes at thousands of Web sites on the Internet. Often, after a Web site is found with one of these large search engines, you still need to find a particular page or document of interest.

Web sites that have a local search engine capability make it much easier to find exactly the subject matter you are after. You can use Index Server to provide such a search capability for both public Web sites and intranet sites that you manage. As the amount of subject matter that you include in your site grows, this capability will quickly become a necessity rather than a nice extra feature.

If you have used a public search engine or have other database experience, you already have a good idea what queries are all about. Information is entered into a form specifying what you are interested in finding. This form is then submitted to the Index Server engine for processing. The queries must be expressed in a query language that has many powerful features. Index Server's query language supports the following capabilities:

Boolean operators (AND, OR, and NOT) can be used. For example, robot AND human finds documents with both words; robot OR human finds documents with either word.
The proximity operator (NEAR) can be used to specify that two search items must appear in the same area of the document. For example, fiscal NEAR responsibility should find documents that address fiscal responsibility, even if the explicit phrase does not appear in the document.
Queries are case-insensitive (i.e. uppercase and lowercase letters are treated the same). For example, Robot, ROBOT, and robot are all treated the same in a query.
The wild card (*) can be used to match a variable number of arbitrary characters at the end of a word. For example, pre* matches prehistoric, prefix, preparation, prevention, and all other words beginning with pre. (This is an example of a query that probably needs to be refined to be useful).
An additional wild card operator (**) will find words based on the same stem word. For example, write** will match write, wrote, written, and writing. Note that while most of these words would match writ* (using the standard wild card), wrote would not. This operator is very useful for matching words that change form as tense changes (fall versus fell, for example) and other similar situations that confound matching with the standard wild card.
Punctuation (e.g., period, comma, and semicolon) is ignored.
Certain words are designated as noise words (e.g. and, the, it), and are completely ignored. The list of noise words (which is in <systemroot>\system32\noise.enu for the English language by default) can be edited if desired. Noise words, also called stop words, are never indexed and, therefore, will never result in a hit if included in a query.
You can use noise words as part of a phrase by enclosing the phrase in quotation marks. The word and is both a noise word and a Boolean operator. If you want to search for the phrase salt and pepper, you should enclose the phrase in quotation marks ("salt and pepper"). Without the quotation marks, salt and pepper will match documents that have both words in them, but do not necessarily contain the phrase. Because noise words are not indexed, they can never be found unless they are part of a phrase.
Free-text queries are supported. These are queries that ask a question in English (or some other language). To enter a free-text query, you must precede the query with the $contents operator. For example, the query $contents how do birds fly? will attempt to find documents that mention birds and flying.
Queries based on properties take the form @property operator value. For example, to find all documents smaller than 4096 bytes, you would enter @size < 4096. See the product documentation for a list of supported property names.

This list provides the basics of forming queries that Index Server can resolve. It is possible to create custom forms that simplify the process of formulating queries for a particular subject matter area. This is especially desirable if the user community at your organization is unfamiliar with query processing. However, many sites will not need to customize the query process at all.

The mechanics of actually resolving the query involve the use of some special files. The original query is combined with information in an Internet data query (.idq) file. This file specifies how the query is to be processed. There are two possible sections in an .idq file: the names section and the query section. The names section is optional and is used only to define nonstandard column names that can be referred to in the query. This section is not needed for standard query processing. The query section is used to specify parameters that are used when the query is processed.

Parameters in the .IDQ file are specified in a variable=value format. A variety of parameters are available to control the behavior of query processing. For example, the location of your catalog that contains all indexes created by Index Server is specified in a variable called CiCatalog. Another variable, CiMaxRecordsInResultSet, controls how much information can be returned as results. The variable CiColumns controls the columns that are returned in the results page, and should match the columns referenced in the .htx file that is used to format the results.

HTML extension files (with a .HTX extension) are used to format the results. These files are created using HTML with conditional statements based on the variables defined and created in the .idq file that was used for the query being handled. Depending on your interest in customizing the query process, it can be enlightening to review the sample files provided to see how they are designed. The most basic query is handled with the following files, assuming you have accepted all defaults during installation:

c:\InetPub\wwwroot\samples\Search\query.htmóThis is the basic query form that is presented as a sample for you to test the use of Index Server. By default, it is not linked into any home page on your Web server. You must add a link to this page or connect to it by entering the URL directly into your browser ( http://<servername>/samples/search/query.htm) if you want to use it.
c:\InetPub\scripts\samples\Search\query.idqóThis is the Internet data query file that corresponds to the basic query form. You can read the comments provided, which describe the variables used and make suggestions for configuration changes that a Webmaster might make.
c:\InetPub\wwwroot\samples\Search\query.htxóThis is the HTML extension file for the results that are returned by queries created with the two preceding files. This can also be customized to change the way the information is presented.

Now that you have been introduced to the query process, the action that occurs behind the scenes to index the content on your Web servers is described. The next few sections track the various actions that occur when new content is added to an indexed directory on your Web server, culminating in a set of entries in the master index.

Scanning

Indexing starts with a scanning process. By default, all virtual roots defined on your IIS server will be indexed. You can add additional virtual roots and include them in the indexing process as needed. You can also exclude virtual roots from indexing if you desire. These virtual roots can be directories on the Web server machine itself, or shared directories on other servers.

Windows NT supports automatic change notification and will initiate the scanning process when new files are added to the server. Other file servers (e.g., Windows 95 and Novell NetWare) do not support this feature, and must wait for a periodic, scheduled scan to occur. A Registry entry (ForcedNetPathScanInterval), which can be configured by the administrator, controls the frequency of these scans.

There are two types of scans that are performed by Index Server: incremental scan and full scan. The first time a directory is scanned, a full scan of all contents is performed. Thereafter, only an incremental scan is necessary to accommodate the changes that have occurred. Occasionally, an additional full scan could be necessary. For example, after a server suffers a catastrophic failure, a full scan is needed. No administrator intervention is normally needed, even for this type of recovery operation. Index Server has been designed to recover from failures automatically unless an unusual circumstance should occur and go undetected (e.g., Registry corruption). A full scan or an incremental scan can be forced at any time by the administrator.

Filtering

Once a directory has been scanned, a three-step filtering process takes place, described as follows:

1. A filter DLL examines the document and extracts the text and properties that are accessible. For many binary files (that don't have a corresponding custom filter), only properties would be able to be extracted.

2. A second DLL, known as a word-breaker DLL, parses the text and any textual properties into words.

3. The resulting list of words is compared with the list of noise words, which are removed and discarded. The remaining words will be processed and included in the index.

Filtering occurs under the direction of the CiDaemon process, which is spawned by the Index Server engine (see the following Note). It must analyze the list of documents that have been scanned and sent for indexing to determine which is the appropriate filter DLL and which is the appropriate word-breaker DLL. As previously mentioned, different filter DLLs and word-breaker DLLs are required to handle documents of different types and in different languages.

NOTE: Daemon is another name for a background process that runs without requiring user intervention. The term is most commonly used in UNIX environments. In a Windows NT environment, the term service has roughly the same meaning and is used much more frequently.

In addition to generating words to be merged, the filtering process also generates a characterization. This is a short summary of the item being indexed that can aid the user in deciding if this is a document or file that is of interest. The Registry key GenerateCharacterization is set to 1 (by default). If this entry is set to 0, characterizations will not be generated.

Index Server is designed to be operational 24 hours a day, seven days a week. Therefore, it does most of its work in the background and attempts to work only when the server is idle. It also closes any documents that it is processing as quickly as possible if they are requested by another user or application. Filtering of that document will automatically be retried later.

CAUTION: If directories on shared network drives (on another server) are being indexed, the files that are opened by Index Server will not be quickly released when another user or process requests them. This feature (quickly releasing files needed elsewhere) is not available when indexing shared network drives. Therefore, the filtering process may temporarily hold a file lock on a document while it is being filtered. Use discretion when deciding which directories should be indexed.

To avoid interfering with other more urgent processes on a server, the CiDaemon process runs in the idle priority class by default. In other words, it filters documents only when there is no other work of a higher priority to perform. If you intend to use Index Server on a fairly busy computer, this could result in lengthy delays before documents are filtered and a subsequent backlog will occur. To increase the priority of the CiDaemon processóunderstanding that this may impact the throughput of other work on this serveróyou can set the ThreadPriorityFilter Registry key to THREAD_PRIORITY_NORMAL and the ThreadClassFilter to NORMAL_PRIORITY_CLASS.

CAUTION: Changing any of the Registry entries as described in this chapter, especially the priority level, should be approached with extreme care. Improperly editing the Registry can result in corruption of information and the need to reinstall the operating system and restore the most recent backup. This operation should be attempted only by experienced administrators, and only after a current backup is made and the RDISK utility is run to create an updated repair disk.

When the filtering process is complete, the resulting word lists are merged as described in the next section.

Merging and Index Creation

Indexes are used in many different computer applications. There are many different types of indexes for different purposes. The indexes created by Index Server are designed for the purpose of rapidly resolving the search queries used when trying to locate documents or other content on Web servers. The words and properties that have been extracted during the filtering process by CiDaemon are merged into a permanent index that is stored on disk.

Because Index Server must operate in an environment in which many other activities are being performed (potentially) on the same machine concurrently with its operationsóincluding the need to resolve queries based on the current content and indexes that already existóa multi-step process is used that culminates in a single, up-to-date index. Depending on the load placed on the server and the amount of new information that is being added, there are intermediate stages that result in a more complex state than a single index.

As already described, the filtering process results in word lists. These can be thought of as mini-indexes for a small collection of documents. They are stored in RAM as they await further processing. If a power loss occurs, these word lists are lost, but Index Server is designed to recover automatically from such an event. Because they exist in RAM, the creation of a word list is very fast.

Word lists are merged to form shadow indexes. These are stored on disk and will, therefore, survive a power loss. More than one shadow index can exist in the catalog, which is the directory containing all indexes for an Index Server. The process of merging word lists (and occasionally other shadow indexes) to form a shadow index is called a shadow merge. During the shadow merge process, additional compression is performed on the information stored in word lists to further optimize storage and retrieval.

A master merge is eventually performed to create the master index. During this process, all shadow indexes and the current master index are merged to create a new master index. At any given moment, there is only one master index. If the server has had an opportunity to "get caught up," then there will not be any shadow indexes or word listsójust the master index. In other words, if there is sufficient processing power and no new documents are added for a period of time, the natural progression of things will result in a single master index and no other index structures. As new documents are added, the process starts again. Index Server is capable of operating properly in any intermediate state, but is most efficient when working with just the (complete) master index.

The total number of indexes on a very busy server can grow as high as 255. If the server is so busy that even more shadow indexes would be required, the server will fail and some reconfiguration will be required to provide faster disk subsystems, additional CPU power, or other additional resources so that it can accommodate the load required. A master merge on a very active machine can be a complex and lengthy process. The automatic recovery capability of Index Server includes even this complex operation. System failure in the midst of a master merge operation is fully and automatically recoverable.

Installing and Using Index Server

Now that you know how Index Server operates, it is time to learn how to install and use this powerful tool. The following procedure assumes that you have already installed Windows NT Server and IIS. In addition, you should be logged on with administrative rights to the machine, which is set up as an Index Server. If you want to index the contents of other file servers or Web servers, you should define additional virtual directories on the IIS WWW service using the Internet Service Manager. You can add additional virtual directories at a later time if you prefer. For more information about defining virtual directories, see Chapter 18, "Building a Web with Internet Information Server (IIS)."

To install Index Server, follow these steps:

Insert the appropriate CD or connect to a shared network directory containing the Index Server distribution file.
Launch the executable distribution file (is11enu.exe for the English language version).
A Welcome dialog box is displayed. Click the Continue button.
Another dialog box is displayed (see Figure 21.2). This dialog box requests the location (the full physical path) of the IIS scripts directory so that it can install Index Server sample scripts. By default the directory is c:\InetPub\scripts. A virtual directory can also be used. Either accept the default, or override it with your preferred location. Click the Continue button.
FIG. 21.2
This dialog box is used to specify the location of IIS scripts.
Another similar dialog box is presented. Enter the location of the IIS virtual root. By default, this is c:\InetPub\wwwroot. Again, either accept the default, or override it with your preferred location. Click the Continue button.
A final dialog box requests the location to store the Index Server catalog (the collection of indexes including the master index). By default, this is c:\ISIndex. Choose a location for the catalog to be stored, or accept the default. Click the Continue button.
A file copying progress bar is displayed. When all the files have been copied, you are informed that the process is complete, and the URL for the sample search page (also called a query form) is provided. By default, this is http://<servername>/Samples/Search/queryhit.htm. Index server has been successfully installed.

Once you have installed Index Server, you naturally will be eager to test its functionality. If you already have an operational Web server with an interesting collection of content, all you need to do is wait. Index Server's operations are automatic, and the scanning, filtering, merging, and index creation process occurs without further intervention. Depending on the amount of content and the load placed on the server, you should allow anywhere from 10 minutes to several hours for the indexing process to produce useful results.

You can start by reviewing the online Index Server Guide. This is accessible by choosing Start, Program, Microsoft Index Server, Index Server Online Documentation (see Figure 21.3). Alternatively you can connect to this Web page by manually entering the URL (http://<servername>/srchadm/help/default.htm by default).

FIG. 21.3

Index Server includes online documentation in HTML format.

Once an appropriate period of time has elapsed, you are ready to try a search. Choose Start, Program, Microsoft Index Server, Index Server Sample Query Form (see Figure 21.4). This Web page is the sample query form that is discussed in the introduction to this chapter. It enables you to test the search capabilities of Index Server.

FIG. 21.4

The sample query form provided with Index Server is a working search page that can be used without modification or customized to meet specific needs.

To test your Index Server, follow these steps:

Open the sample query form as described in the preceding paragraph.
Enter a keyword or phrase that you believe will result in a successful query result. Click the Execute Query button.
The results of your query are returned and displayed in your browser (see Figure 21.5). In addition to listing the documents or pages that matched, a characterization, or short summary of the document, is automatically generated.

FIG. 21.5
The results of a query are formatted in HTML and returned to your browser for display.
Depending on the type of documents returned in your result set, the manner in which your browser is configured and the software installed on your computer, you may be able to click a hyperlink in the results and directly open a matching document. In Figure 21.6, a Microsoft Word document has been opened in the browser window.
FIG. 21.6
If properly configured, the Microsoft Internet Explorer (version 3.01 for Windows NT shown here) can host documents within its window, such as this Microsoft Word document.
Choose Go, Back, or click the Back toolbar button. This takes you back to the query results page. At the bottom of this page, you can click the New Query hyperlink to return to the query form (or, in this example, you could just click the Back button one more time).
You can continue refining your search by entering new queries until you find the document you want. Simply close your browser when you are done.

You are now familiar with the main procedures for using Index Server. By design, it is easy to use and manage. With the growth of most Web sites, indexing is becoming a critical feature that is needed to help people find the information they need. Index Server fills this important role with a minimum of work on your part. The next section discusses another member of the BackOffice family called the Content Replication System.

The Content Replication System

The rapid growth of the Internet, and particularly the World Wide Web, took most people by surprise. In a very short time, the Web has grown from an interesting novelty to a useful tool for communicating with the world. This is not to discount the important role it has played for years in the academic and defense communities. In these environments, which have large populations of highly trained individuals, the Internet has long been a widely used and powerful tool. But it wasn't until the invention of the graphical Web browser and the concurrent improvements in network and computer hardware that it gained wide appeal to "ordinary" people.

Thanks to a growing audience, the Internet has now become an excellent platform for organizations to communicate with potential constituents, customers, and other businesses (to the lament of some early Internet users). With this growth has come a rapid maturation in the process and tools used to build and manage Web sites and the content that drives them.

A natural metamorphosis has taken place in many organizations as the sophistication of their Web sites has evolved. Starting with a single interested experimenter, these efforts now may involve one or more Webmasters coordinating the work of graphic artists, musical and sound clip editors, animation specialists, and other special purpose content experts. What is more, all of this content must be organized, and (even worse) constantly updated. In order to cope with these challenges, Webmasters have begun to use some traditional tools for new purposes and to foster the development of an entirely new set of special purpose tools.

Source control systems, used in the past to coordinate the orderly interaction of a group of programmers working together on a body of computer code, have been pressed into duty to manage Web content. These systems not only enable users to "check out" and "check in" files, but also track revisions and even restore an older version if the latest becomes corrupted. Microsoft's SourceSafe is an example of this genre, and a white paper is available on the Microsoft Web site (http://www.microsoft.com/ssafe) describing the use of this product for Web content management.

A new entry in the Web management repertoire is the Microsoft Content Replication System (CRS). This product is designed to move Web content from one computer to another. There are a variety of scenarios in which the product can be used, and a number of methods that are supported. This section does the following:

Introduces the product in more detail by outlining some scenarios in which the product would be useful.
Describes how to set up CRS.
Walks you through a sample replication.

Although the product could be used to replicate arbitrary information, it is specifically engineered to be used in a Web server environment. For example, one option enables you to specify the content to be replicated by providing a Uniform Resource Locator (URL)óbasically, a Web address.

Content Replication Scenarios

In this section, a variety of scenarios involving the need for content replication are explored. In addition, the role that CRS can play is outlined in order to familiarize you with the type of work this product can perform. Because this is a relatively new product category that is much less familiar than others (e.g., word processing), it is important to spend a little time understanding how it is used before actually deploying it in a production environment.

If you have a modest Web site with under 50 pages managed by a single author, these scenarios will not reflect your environment. When your site grows to hundreds or thousands of pages with dozens of authors, some content management tools are necessary.

A simple example is presented first. Only two computers are involved in this scenario, although it could be a component in a much larger architecture. Figure 21.7 represents a Web content developer's desktop computer linked to a Web server. The content developer may be using a variety of tools to create Web pages and test them on a Peer Web Server, such as that provided by Windows NT Workstation. The tools used to create the pages is unimportant in terms of CRS.

FIG. 21.7

In this simple content replication scenario, Web content is pulled from a desktop computer to a server.

On the same network is a Web server: a Windows NT Server running IIS. The same server is also running CRS. This server would typically be a more powerful machine than the Web developer's desktop computer, and may be locked in a wiring closet for security reasons. Because of its greater power and security, it is a suitable platform for sharing information with a large number of concurrent users and, therefore, an appropriate target to place the Web content that has been developed.

At regular intervalsóperhaps every night or once a weekócontent is pulled (copied) from the developer's computer to the Web server. The only action required by the developer is to make sure he has placed finished Web pages at the designated link location specified in the replication. This could even be a simple "under construction" page, but it should never be content that yields errors. A partially finished page or series of pages that are actively being developed should be kept in a different location during construction. The developer can point his own browser at this temporary location for viewing and testing links, then copy to the pickup location when he is satisfied with the results.

All activities performed by CRS are managed as projects. Even a very complex architecture involving many servers and other computers in locations around the world can be broken down into a collection of projects. In this first scenario, all that is needed is a simple pull replication project. To create the project, the URL of the Peer Web Server being used by the content developer is provided and the destination to which the content should be copied. This location should be an active URL on the Web server that will act as the final destination of this content.

One of the key advantages of using CRS, even in this very simple scenario, is the capability to automate the administrative task of moving the content to the active Web server. The Web developer is presumably involved in this activity regularlyóperhaps even full timeóbut it is a nuisance to require intervention by an administrator in order to move the content to the server. However, it may be unacceptable to provide administrative access to the server to a group of Web developers. By automating the process, the content is moved to the right place, simply and securely, at a regularly designated interval.

The next scenario builds on the first, and demonstrates a situation in which CRS plays a more valuable role. In this example, shown in Figure 21.8, the content on a single Web server is replicated to three additional servers. Depending on the network architecture and links between the servers, it may be possible to copy the information simultaneously to the three servers at once. This is a feature of CRS that can be useful if the links between the servers are deemed reliable.

FIG. 21.8

Content from one server can be replicated to multiple target servers. This can occur simultaneously in some situations.

If the links between servers are subject to regular outages, a frame copy can be used. With this feature, content is broken down into frames and sent with error correcting protocols that can detect if a frame has become garbled during transmission. In addition, replication projects that are interrupted can be restarted at the point of disruption; they do not need to restart at the beginning as required when using standard file transfer protocols. This can be a critical feature if you are replicating very large files or a large collection of small files over the Internet or private WAN.

Figure 21.9 shows this scenario developed even further. The content is replicated from a single originating server to a staging server. From here, it is replicated over the Internet in a series of discrete projects to three other Web servers in different locations. This distribution is done for the purpose of providing content at a location near its intended audience. Although in theory any Internet user can reach any public server, reliability and throughput will be enhanced by local availability if you intend to serve a large population of users.

FIG. 21.9

A relatively sophisticated replication scenario involving multiple locations with multiple servers at each is shown here. A staging server is used as a distribution point.

At each of these locations, the content is further replicated to multiple local servers for the purpose of providing scalability and redundancy. If a single server fails, the site as a whole is still available due to the availability of one or more backup servers. In addition, the load can be distributed among the servers to provide improved responsiveness during peak access times.

The final scenario is an example involving a very large corporate Web site. In Figure 21.10, each department is responsible for creating Web content describing its products or services. A central group of Webmasters creates the primary home page (www.companyname.com) and manages the overall infrastructure of servers and replication projects.

FIG. 21.10

The maintenance of a large corporate Web site can be automated by using CRS.

Each department can be assigned a primary URL that corresponds to a virtual root on the corporate Web server. This location is referenced by a link from the main home page. It is also the target of a content replication project that moves content from a departmental staging server to the primary Web servers at regular intervals. The department's own Web content developers can build whatever structure they want within their own pages and can refer to other well-defined URLs in other departmental pages. If references are made to another department's content other than the virtual root, care must be taken to ensure that the URLs don't change without notification.

The primary administrative chore then becomes the maintenance of CRS projects. CRS lends itself to automatic monitoring through its support of Performance Monitor counters. Alerts can be created in Performance Monitor to notify the appropriate administrators by launching a batch file or application. You could use this mechanism to send an e-mail message or trigger a beeper.

There are additional features for automatic server and process monitoring built into SQL Server and Exchange Server that can be used to monitor the services that implement CRS, create entries in the Windows NT event log, and restart services or even reboot servers automatically. For more information about automating network administration tasks, see Chapter 50, "Proactive Network Administration."

Setting Up the Content Replication System

In this section, you learn how to install CRS on a server. If you intend to use the command-line interface exclusively, then any Windows NT server with sufficiently powerful components (processor, disk drive subsystem, and network interface) will suffice. CRS can be run on either Windows NT Server or Windows NT Workstation. Clearly, if the CRS system is also intended to act as a Web server (as opposed to just a staging server for information that is being moved), it must be running IIS.

Most people will want to take advantage of the CRS Web Administration tool, even if they occasionally use the command-line interface for auxiliary tasks or to confirm the status of a project. This tool is somewhat different from other BackOffice administration tools. It is Web browser-based, which a growing number of BackOffice products are adding, but this is still fairly new. Also, it uses a different style of buttons and controls than you may be used to if you manage other BackOffice products. You may also find it necessary to use the Refresh button to update the display more often with this tool than most others in the BackOffice administration suite, primarily because of its Web-based design. It is, however, relatively easy to learn and use.

NOTE: In order to use the Web Administration tool, you must install CRS on a Windows NT 4.0 system that has been configured to use the NTFS file system. In order to ensure security, the CRS Web Administration tool is not supported on disk drives configured with FAT partitions.

As with all BackOffice products, administrative tasks can be initiated from client computers connected to the server over the network. In order to run the CRS Web Administration tool on a client computer, you must be running either Microsoft Internet Explorer 3.01 (or later) or Netscape Navigator 3.0 (or later).

In order to administer CRS, you must be able to access the Web Administration tool through IIS. This is subject to IIS security restrictions as described in Chapter 18, "Building a Web With Internet Information Server (IIS)." In addition, you must be an administrator for the Windows NT system that runs CRS. You should also create a service account to run CRS services using the User Manager for Domains utility. For more information about Windows NT security, see Chapter 7, "Administering Windows NT Server."

To install CRS, follow these steps:

Insert the distribution media (usually a CD) into a drive on your Windows NT computer. Find the Setup program and launch it. You will see the dialog box shown in Figure 21.11. Click the OK button.

FIG. 21.11
The InstallShield Self-extracting EXE dialog box begins the installation of CRS.
After another intermediate dialog box that requires no response (but simply updates you on the preparation process), you are presented with the Welcome dialog box for the Setup program (see Figure 21.12). Click the Next button.

FIG. 21.12
This is the Welcome dialog box for the Content Replication System.
A license agreement dialog box is presented next. Carefully read the license and click Yes to accept the license agreement. The next dialog box displays the names of any services that must be stopped in order to complete the installation (see Figure 21.13). Click Yes to stop the services and continue installing CRS.

FIG. 21.13
This dialog box lists the Web-related services that must be stopped before the installation process can continue.
A series of messages informing you about services being stopped follows; then, a Registration dialog box is displayed. Enter your Name, and your Company (if appropriate). Click the Next button.
You are now prompted to select the directory in which you want CRS installed. Select the directory you want to use, or accept the default (which is automatically created), and click the Next button.
The Web Administration and Document Directory dialog box asks you for the virtual root on your Web server that will contain the Web Administration tool and online documentation (see Figure 21.14). Click the Next button.

FIG. 21.14
The Web Administration and Document Directory dialog box is used to select the location for the Web Administration tool and the product documentation.
You must now choose the setup type, either Typical or Custom. If you select Typical, the Setup program will install standard components without further input. If you select Custom, the dialog box shown in Figure 21.15 is displayed.

FIG. 21.15
The Select Components dialog box is used to customize your CRS installation.
You can select those elements of the system you would like to install, and ensure that adequate disk space is available. When you are satisfied with your selections, click the Next button. The CRS Service Account dialog box is displayed (see Figure 21.16).

FIG. 21.16
Use the CRS Service Account dialog box to select the service account that will be used by CRS.
The service account establishes the security context that the CRS service will have during its execution. Select the account you created earlier (or task switch to User Manager for Domains and quickly create the account). Click the Next button to continue.
The Schedule Service Account dialog box is presented. This is typically the same account used for the CRS service. Select an appropriate account and click the Next button to continue.
Select the folder to contain the CRS shortcut icons to start the Web Administration tool and review online documentation. Either choose a folder or accept the default.
A confirmation dialog box gives you an opportunity to review all your choices. Click Next to begin the actual installation. When setup is complete, a final dialog box offers you an opportunity to view the CRS Start Page. Click the Finish button to complete setup and view the Start Page (see Figure 21.17).

FIG. 21.17

The CRS Start Page provides an introduction to CRS and links you to the Web Administration tool.

You are now ready to use CRS to move Web content on your network. This is described in the next section.

Using CRS to Move Web Content

There are two interfaces that can be used to control CRS. The first is a Web browser-based interface that uses JScript applications to create and monitor the status of projects. The second is a comprehensive command-line interface. Both of these interfaces are described in this section.

Remember that all replication events are managed as projects óno matter how complicated an architecture you want to create. These projects operate on pairs of servers, either pulling or pushing information from one server to the other. Even large replication architectures spanning global networks are based on this simple concept. In the next two sections you will learn how to create both push and pull projects, the basic ingredients for all replication scenarios.

It is a good idea to experiment with CRS in a lab environment before deploying it on your production servers. This is generally true of all BackOffice products, but can be especially important with a product like CRS that is capable of moving very large amounts of information, and consequently having a big impact on network bandwidth and server performance. When it comes time to implement CRS on your production environment, you should have a good idea what will happen, based on the tests you have performed. You want to know how long a typical operation will take, the best time of day to perform that operation, and the impact a replication project will have on active Web users.

Pull Projects

A pull project is designed to connect to a source URL (a Web address) and copy the content is finds there to a specified target directory on the server that is running the pull project. The server "pulls" the content from the source, hence the name.

You can either request that all content at the source be pulled, or only a specified number of levels be copied. If you specify a limit of two levels, for example, the initial page (typically default.htm or index.htm, depending on how the server is configured) will be copied, and the links on that page will be followed. The content of the linked pages will be copied, and the links contained in them will be followed and copied. This is what is meant by two levels deep.

To create a pull project using the Web Administration tool, follow these steps:

Launch the Web Administration tool. This can be done by using the icon from Start, Programs, or directly by entering the URL into your browser (http://servername/crs/admin/crs.pgi by default). This opens the CRS Web Administration tool's initial page (see Figure 21.18).

FIG. 21.18
This figure depicts the first page of the Web Administration tool for CRS, a browser-based administration utility that lets you manage Servers, Projects, Routes, or configuration Settings.
Click the Add Project link.
An Add Project dialog box is displayed (see Figure 21.19). Enter a Name for the project, and select the type (Pull replication in this example). Click the OK button.

FIG. 21.19
The Add Project dialog box enables you to create a project and select the type of replication that you want to use.
The Project Source/Target page is displayed (see Figure 21.20). Enter the Source URL from which content will be pulled, then enter the number of levels the replication should pull (the depth of links that it should follow). Click the All button to pull the entire Web site from that virtual root.

FIG. 21.20
The Source/Target page is used to specify the source and destination for your project.
Enter the Target directory. This is the fully qualified path name on the local server (the server that is running this CRS project). You can optionally enter a user name, password, and proxy server name if required to access the source server. Click Save, and then click the Projects link to return to the main Projects page. Click the Schedule link. This presents the Schedule page.
Select the Replicate Automatically button, or click Add Schedule to add a specific scheduled date and time (see Figure 21.21).

FIG. 21.21
The Add Schedule dialog enables you to specify when the replication occurs. Select the days of the week, and the time of day.
Select the schedule for this project and click the OK button to return to the Schedule page and continue.
Click the Email/Scripts link. This will present the Email/Scripts page (see Figure 21.22). E-mail notification can automatically be sent on success or failure of a replication event, or every time the event executes. You can also run a script before any content is received on the server, or after it is all received. This script may be used for notification, for routine maintenance chores, or other application specific purposes. Click Save, and then click the Projects link again.

FIG. 21.22
The Email/Scripts page is used primarily for notification purposes
You should now see your project added to the list on the Projects page (see Figure 21.23). It should have a status of Idle. Click the Idle link.

FIG. 21.23
The Projects page will list all active projects, and show their current status. You can remove a project by clicking the waste can icon to the left of the project's name.
A Project Status dialog box will be displayed (see Figure 21.24). This dialog can be used to start a project immediately, regardless of the scheduled times that may have already been set for this project. Click Start if you want to test your new project. The State displayed on the dialog box will be updated from Idle to Running to reflect the new status of the system.

FIG. 21.24
In addition to starting a project, the Project Status dialog box can also be used to stop an executing project, or to rollback (undo) the effects of an earlier replication.
Click OK when you are done with this project. You can enter additional projects to monitor, or exit the CRS Administration tool by choosing File, Close from the menu.

Push Projects

A push project operates in a reverse manner from a pull project. It is initiated on the source server, and "pushes" content to one or more destination servers. In addition, a push project must be defined on both the source and destination servers, although only the source server is configured for a target (destination).

To create a push project using the Web Administration tool, follow these steps:

Launch the Web Administration tool. This can be done by using the icon from Start, Programs, or directly by entering the URL into your browser (http://servername/crs/admin/crs.pgi, by default). This opens the CRS Web Administration tool's initial page (refer to Figure 21.18).
Click the Add Project link. An Add Project dialog box is displayed.
Enter a Name for the project, and select the type (Push replication in this example). Click the OK button.
The Project Source/Target page is displayed (see Figure 21.25). Enter the Project Directory from which content will be pushed, and then select an option button to indicate which subdirectories (if any) should be copied with the main project directory. Click the Add Target link.

FIG. 21.25
The Source/Target page for a push project is used to specify the project directory to be replicated, whether or not to include subdirectories, and to add Targets (destinations) for the push event.
The Add Target dialog is displayed (see Figure 21.26). Enter the name of the Target machine (e.g. HQSRV2). You can also specify that a route (a predefined chain of linked servers) should be used as the target, although this is beyond the scope of this book. Click OK and then Save. You can now click the Projects link to return to the main Projects page.

FIG. 21.26
The Add Target dialog box is used to specify the target of this replication event. Only one machine can be specified in a target, but multiple targets may be defined for the same project.
You should now see your project added to the list on the Projects page. It should have a status of Idle.
In order to complete the configuration of the push replication project, you must repeat steps 2 and 3 on the target server. Click the Servers link and enter the name of the destination server in the Select Server dialog box. You will be connected to the destination server, and any currently defined projects will be displayed.
Click the Add Project link. An Add Project dialog box is displayed.
Enter the same project name that you used on the originating server, and a project directory that will receive the replicated contentóbut you need not specify a target server since this server is the final destination.
Click Save and return to the Projects page to verify that the new project is visible.
You have now completed the configuration of the push project. In order to start the project, you should once again connect to the originating server, go to the Projects page, and click the Idle link to open a Project Status dialog. Then click the Start button to start the project.
Click OK when you are done with this project. You can enter additional projects to monitor, or exit the CRS Administration tool by choosing File, Close from the menu.

This brief introduction to the Content Replication System should give you a solid basis on which to build additional knowledge and experience with this new member of the Microsoft BackOffice family. This product is only in its first release, and it will undoubtedly continue to be enhanced with additional features and capabilities. With the growing use of the Web, and the challenges inherent in managing the large volume of content necessary for a great Web site, the importance of this type of tool will multiply.

From Here...

This chapter presented the features available in two Microsoft BackOffice family components: Index Server and the Content Replication System. It provided detailed information on installing, configuring, and using these components to help manage your Web site. For more information about the topics addressed in this chapter, see the following chapters:

For information on other Internet components available for Microsoft BackOffice, see Chapter 16, "The BackOffice I-Net Toolbox."
For information on using the Internet and building internal intranets to serve the needs of your organization, see Chapter 17, "I-Net Tools and Techniques."
For information on Internet Information Server, see Chapter 18, "Building a Web with Internet Information Server (IIS)."
To learn about client applications that let you access the Internet's World Wide Web, see Chapter 19, "Web Browsers."
For information on creating content for your IIS servers, see Chapter 20, "Using Microsoft FrontPage 97."
To learn about Proxy Server to improve I-net access performance and enhance the security of your IIS site, see Chapter 22, "Implementing Microsoft Proxy Server."
To explore how the Internet can be used in business, see Chapter 47, "Building BackOffice Applications." in Special Edition Using Microsoft BackOffice, Volume 2