Chapter 28 Connecting the Intranet and the Internet

Configuring Windows NT as a Router
Internet Robots
Firewalls
Summary

Sooner or later, you are going to want to take advantage of what you now know about building an internal Web to help your organization establish a presence on the World Wide Web. Whether to help sell your products, share research, publish news, market software, provide customer support, or solicit customer feedback, a site on the WWW can help your organization further its goals.

Until recently, few people thought it was possible to run a decent Web server on Windows NT. I hope this book has convinced you, if it has done nothing else, how untrue that myth is. In fact, I have found Windows NT to be an excellent Web platform in terms of performance, cost, and ease of use. And if you want to build a Web site, without spending years learning about UNIX system administration, this chapter can help you get started.

The aim of this chapter is to go over a few assorted topics that you will likely come across when you connect your Intranet to the Internet. This chapter is by no means complete-whole books are written on the subject of how to build an Internet server. But few of those are devoted to Windows NT Web sites. One in particular is Web Site Construction Kit for Windows NT, written by Christopher L. T. Brown and yours truly. That book will provide you with a complete treatment of all aspects of building an Internet server with Windows NT 4.

This chapter will address using NT and a modem as a router, how to deal with Internet robots, and why you will want to consider building a firewall between your Intranet and the Internet. Most of the material in this chapter is not about running an Internet Web site; rather, it is about connecting your LAN to the Internet in the first place.

Configuring Windows NT as a Router

Let's say you want to connect a small Local Area Network (LAN) to the Internet using Windows NT, Remote Access Service (RAS), and a modem. Your Intranet customers are telling you that they all want access to the World Wide Web for research and software downloading, but you would like to avoid buying separate modems and phone lines for everyone. If you use Windows NT as a router, you only need to buy one modem, one connection to the Internet Service Provider (ISP), and one extra phone. Of course, bandwidth is a consideration, but you can always scale up to ISDN, Frame Relay, or T1 as your needs dictate. Let's be very clear about this; using a 28.8 Kbps RAS modem connection to the Internet does not make for a high-performance router.

If the idea behind this sounds appealing to you, be prepared for the fact that the path to success can be strewn with land mines. This section will present a summary of the steps that I took to make this work on my small LAN. The ideas presented will reflect my experiences, but they are based on advice from the Windows NT Webserver mailing list, conversations with experts (including my ISP), and assorted books and magazine articles (listed in the Bibliography).

Here is an overview of the various tasks that await you as you connect your LAN to the Internet.

You need to obtain a dial-up PPP account with an Internet Service Provider in your area. Either get one subnet account that has at least twice as many IP addresses as you need for the number of client workstations on your LAN, or get one IP address in a separate range from a subnet having enough IP addresses to accommodate all the workstations on your LAN. I'll discuss this in more detail later.
You need to hook up your modem through RAS and enable it to dial into your ISP as a stand-alone machine. Ensure that you can browse the World Wide Web from your NT Server.
You need to configure your NT Server to redial the connection in the event of a disconnect or a server crash. You can use the SomarSoft Redial program for this.
You need to install Network Interface Cards (NICs) in the server and in each workstation machine. Then select and install the appropriate network wiring between the NICs. This step is more a fundamental part of building a LAN than anything else. The workstations can be running Windows NT, Windows 95, or Windows for Workgroups. In the case of WfW, the Microsoft TCP/IP add-on pack must be installed separately.
You need to adjust the Registry on the NT Server to function as a router.
You need to configure the TCP/IP Address, Subnet Mask, and Default Gateway in each of the workstations. Ensure that the workstations can ping the NT Server.
You need to adjust the routing tables on the NT Server to achieve the Internet connection for the workstations.

Once you have accomplished all this, you are ready to surf the Web from the client computers, as well from as the server.

Windows NT Registry Settings for the Router

Based on documentation from various sources, including the Microsoft TechNet CD, the following Registry parameters are a superset of what is strictly required for NT to function as a router between the LAN and the Internet. The reason I say "superset" is that your particular situation will determine whether or not you will need all of these. Future versions of NT may introduce other parameters enabling further optimization. Experimentation may be necessary to achieve the best results. All of the entries are documented in the Windows NT Resource Kit.

Now it's time to fire up RegEdit. To avoid repeating a large portion of the name of each value, I will tell you up front that they are all located under the following key:

HKEY_LOCAL_MAchINE\System\CurrentControlSet\Services

The following values can be found underneath this Services key. If they are not already present, you will need to add them as follows:

RasArp\Parameters\DisableOtherSrcPackets = 0.
RasMan\PPP\IpcP\PriorityBasedOnSubNetwork = 1.
TcpIp\Parameters\IpEnableRouter = 1.
TcpIp\Parameters\ForwardBroadcasts = 1. (This parameter is documented as unused on the Microsoft TechNet CD and its benefit in this scenario is unclear.)

Assigning Default Gateway Addresses on the LAN

The next trick is to assign one IP address to your RAS dial-up connection and a different IP address to each of the NICs on your LAN, including the Windows NT Server. It is very likely that you have already dealt with this issue as part of the overall configuration of TCP/IP on your Intranet.

Once you have assigned a unique IP address to each NIC on the LAN, set the Default Gateway on each of the client machines to the address of the NIC on the NT router. The Default Gateway of the NIC on the NT Server itself should be left blank. These steps are very important, so make sure that you apply them accurately. For reasons that I will get into in a moment, you must not try to take a shortcut by sending the client machine packets straight to the IP address of the RAS connection-the Default Gateway on the clients must be the NIC on the NT Server.

It might help if we develop a mental picture of what's going on here. Each machine on the LAN needs to be told where to send outgoing packets so that they will find their way out to the Internet. Otherwise, outbound packets would have no way to get off the LAN. The NT Server with the RAS connection is the gateway (or router). Since the NT Server is a dual-homed host (supporting a NIC and running RAS), packets from the NICs on the LAN that are sent to the IP address of the NIC on the NT Server will be routed by the NT box from its NIC over to its RAS connection. This will be accomplished within the TCP/IP software running in Windows NT. From there, the packets will travel through the modem to the ISP and on out to the Internet.

The Default Gateway of the RAS TCP/IP connection on the NT Server appears to be non-configurable in NT 4, but that is okay because RAS conveniently seems to assume that its Default Gateway is on the network it is dialing into, namely your ISP. For example, if you type ipconfig after establishing a RAS connection to your ISP, you will see that the Default Gateway is the same as the IP address assigned to RAS.

Assigning Subnets

Subnets can be a rather mysterious area of TCP/IP. The most important point about subnetting, as relates to this discussion, is that the IP address of the RAS connection must be in a separate subnet of the IP address of the NIC in the NT Server. That is why the clients cannot list the RAS IP address as the Default Gateway-because technically they can't see it.

TCP/IP Addressing

For all TCP/IP issues great and small I highly recommend Internetworking with TCP/IP, Volume I, Third Edition, by Douglas E. Comer as an excellent permanent reference book. There is no way I can possibly give a complete treatment of IP addressing here, so I'll just take a stab at the basics. A Class C TCP/IP subnet can contain up to 254 separate IP addresses. The addresses are of the form X.A.B.C, where X, A, and B are fixed by the ISP, and C ranges between 0 and 255 for each machine on your LAN. For a Class C address, the X in the above address will always be greater than 192. X numbers less than 128 are Class A addresses. X numbers between 128 and 191 are Class B addresses.

Each of the numbers in a TCP/IP address consists of 8 bits, providing 256 possibilities, ranging between 0 and 255 in decimal. By convention, any TCP/IP subnet reserves the address with all zeros to mean the address of the network itself, and the address with all ones to mean the broadcast address. That is why each subnet may contain two fewer computers than the size of the subnet.

If you can afford to buy one IP address for your RAS connection and a complete (separate) Class C address for the rest of your LAN, you will have a very easy time connecting your LAN to the Internet. In that case, you will simply use a subnet mask of 255.255.255.0 for each IP address on your LAN. The subnet mask for RAS would also be 255.255.255.0 because it would already be on a different network segment.

If you want to perform a bit pattern subnet, things definitely get more involved. Let's suppose you have six client computers to connect to the Internet. You might think that you can get by with a subnet-8 account, as leased by many ISPs. But remember, you need one IP address for RAS, a different IP address for the NIC in the NT Server, and the all zeros and all ones addresses on the subnet will be skipped by convention (although in certain cases you can violate that rule).

In other words, you need a subnet big enough to hold a total of ten IP addresses. At this point you would think that all you need is a subnet-16, again an easy thing to lease from an ISP. Unfortunately, it's not even that simple. Remember, the RAS IP address must be in a different subnet. Each bit you use for a subnet divides the range of usable addresses in half.

You could divide the subnet-16 in half. Then you could devote the 0 through 7 subnet numbers for RAS-skipping the zeroeth address as the subnet address itself, you could assign the first subnet address to RAS (addresses 2 through 7 would be wasted). You would then give the numbers 8 through 15 in the higher subnet to the six client machines, actually skipping 8 and 15 as per convention. That would work except that we left out the NIC in the NT Server. Therefore, you would either need to step up to a block of 32 addresses or you would need to drop one client machine.

Whatever method you use, your ISP should be able to help you establish the bit pattern for the subnet mask on the clients. If you can't afford to go with a full Class C address for the LAN, the fourth octet in the subnet mask must be split into a binary number with zeroes used in the low-order bits for the number of addresses needed. The subnet mask on the RAS connection can simply be a Class C mask in all cases: 255.255.255.0. I believe that this is due to the fact that RAS always behaves as a LAN-to-WAN router, possibly allowing it to ignore the subnet mask.

Adding Static TCP/IP Routes for the Workstations

There is just one more step to get your LAN packets to travel out to the Internet. Actually, the purpose of this step is to let the data packets returning from the Internet find their way from the RAS connection on the NT Server over to the client machine that initiated the transaction.

You need to add static routes to the route table on the NT Server. This is done at the DOS command prompt using the route add command.

I'll explain this by example. Suppose you have given the NIC in the NT Server an IP address of 200.9.9.1 and the NIC address of the first client machine is 200.9.9.2. The procedure must be duplicated for each client machine.

The following command would provide a static route from the NT Server to the first client:

route add -p 200.9.9.2 200.9.9.1

The -p parameter specifies to add this as a persistent route. NT will store it in the Registry so it will be retained after subsequent reboots.

Finally, if you find that your client machines can ping Internet addresses by TCP/IP number only, but DNS name resolutions and Web browsing do not work, it could be that your ISP has not provided static routes on their end to pass network traffic into your RAS connection when the destination address is one in your subnet.

Internet Robots

World Wide Web robots, sometimes called wanderers or spiders, are programs that traverse the Web automatically. A robot's job is to retrieve information about the documents that are available on the Web and then store that information in some kind of master index of the Web. Usually, the robot is limited by its author to hunt for a particular topic or segment of the Web.

At the very least, most robots are programmed to look at the <TITLE> and <H1> tags in the HTML documents they discover. Then they scan the contents of the file looking for <A HREF> tags to other documents. A typical robot might store the URLs of those documents in a data structure called a tree, which it uses to continue the search whenever it reaches a dead-end (more technically called a leaf-node). I am oversimplifying this a bit; the larger robots probably use much more sophisticated algorithms. But the basic principles are the same.

The idea behind this is that the index built by the robot will make life easier for us humans who would like a quick hop to information sources on the Internet.

The good news is that most robots are successful at this and do help make subsequent search and retrieval of those documents more efficient. This is important in terms of Internet traffic. If a robot spends several hours looking for documents, but thousands (or even millions) of users take advantage of the index that is generated, it will save all those users from tapping their own means of discovering the links, potentially saving a great amount of network bandwidth.

The bad news is that some robots inefficiently revisit the same site more than once, or they submit rapid-fire requests to the same site in such a frenzy that the server can't keep up. This is obviously a cause of concern for Webmasters. Robot authors are as upset as the rest of the Internet community when they find out that a poorly behaved robot has been unleashed. But usually such problems are found only in a few poorly written robots.

Fortunately, guidelines have been developed for robot authors, and most robots are compliant. An excellent online resource for information about robots, including further information on which much of this material is based, see "World Wide Web Robots, Wanderers, and Spiders" by Martijn Koster, http://info.webcrawler.com/mak/projects/robots/robots.html. It contains links to documents describing robot guidelines, the standard for robot exclusion, and an in-depth collection of information about known robots.

Tip

The Internet community puts up with robots because robots give something back to all of us. A private robot, on the other hand, is one that you might customize to search a limited realm of interest to you or your organization. Private robots are frowned upon because they use Internet resources, but only offer value to a single user in return. If you are looking for your own Internet robot, however, you can check out the Verity Inc. home page at http://www.verity.com/. Please remember that one of the guidelines of robot design is to first analyze carefully if a new robot is really called for.

A good understanding of Web robots and how to use or exclude them will aid you in your Web ventures; in fact, it could help to keep your server alive.

Excluding Robots

There are lots of reasons to want to exclude robots from visiting your site. One reason is that rapid-fire requests from buggy robots could drag your server down. Also, your site might contain data that you do not want to be indexed by outside sources. Whatever the reason, there is an obvious need for a method for robot exclusion. Be aware, however, that it wouldn't be helpful to the Internet community if all robots were excluded from all sites.

On the Internet Web-related newsgroups and listservers, you will often see a new Web site administrator ask the question "What is robots.txt and why are people looking for it?" This question often comes up after the administrator looks at his or her Web access logs and sees the following line:

Tue Jun 06 17:36:36 1995 204.252.2.5 192.100.81.115 GET /robots.txt HTTP/1.0

Knowing that they don't have a file robots.txt in the root directory, most administrators are puzzled.

The answer is that robots.txt is part of the Standard for Robot Exclusion. The standard was agreed to in June 1994 on the robots mailing list (robots-request@webcrawler.com) by the majority of robot authors and other people with an interest in robots. The information on these pages is based on the working draft of the exclusion standard, which can be found at this URL:

http://info.webcrawler.com/mak/projects/robots/norobots.html

Some of the things to take into account concerning the Standard for Robot Exclusion are:

It is not an official standard backed by a standards body.
It is not enforced by anybody, and there are no guarantees that all current and future robots will adhere to it.
Consider it a loose standard that the majority of robot authors will follow.

In addition to using the exclusion described below, there are a few other simple steps you can follow if you discover an unwanted robot visiting your site:

Check your Web server log files to detect the frequency of document retrievals.
Try to determine where the robot originated. This will enable you to contact the author. You can find the author by looking at the User-agent and From field in the request, or look up the host domain in the list of robots.
If the robot is annoying in some fashion, let the robot author know about it. Ask the author to visit http://info.webcrawler.com/mak/projects/robots/robots.html so he or she can read the guidelines for robot authors and the standard for exclusion.

The Method

The method used to exclude robots from a server is to create a file on the server that specifies an access policy for robots. This file must be named robots.txt, and it must reside in the HTML document root directory.

The file must be accessible via HTTP, with the contents as specified here. The format and semantics of the file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR, CR/NL, or NL). Each record contains lines of the form:<field name>:<optional space><value><optional space>.The field name is case-insensitive. Comments can be included in the file using UNIX Bourne shell conventions. The # character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination are discarded. Lines containing only a comment are discarded completely and therefore do not indicate a record boundary. The record starts with one or more user-agent lines, followed by one or more disallow lines. Unrecognized headers are ignored.
User-agent The value of this field is the name of the robot for which the record is describing an access policy. If more than one User-agent field is present, the record describes an identical access policy for more than one robot. At least one field needs to be present per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended. If the value in the record describes the default access policy for any robot that has not matched any of the other records, it is not allowed to have two such records in the robots.txt file.
Disallow The value of this field specifies a partial URL that is not to be visited. This can be a full path or a partial path; any URL that starts with this value will not be retrieved. For example:Disallow: /helpdisallows both /help.htm and /help/default.htm, whereasDisallow: /help/disallows /help/default.htm but allows /help.htm.

Any empty value indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record. The presence of an empty /robots.txt file has no explicit associated semantics; it will be treated as if it was not present-for example, all robots will consider themselves welcome to scan.

Examples

Here is a sample robots.txt for http://www.yourco.com/ that specifies no robots should visit any URL starting with /yourco/cgi-bin/ or /tmp/:

User-agent: *
Disallow: /yourco/cgi-bin/
Disallow: /tmp/

Here is an example that indicates no robots should visit the current site:

Useragent: *
Disallow: /

Firewalls

If you intend to maintain an Internet connection and you truly want a secure site, you should consider getting firewall protection. A firewall can be software, hardware, or a combination of the two. Commercial firewall packages cost a lot more than loose change-usually the price range is anywhere from $1,000 to $100,000. If you are using NT, RAS, and a modem as a software-based router, then the first step in building a firewall is to change from a RAS-based connection to an Ethernet/router hardware-based connection.

Note

For a much more thorough treatment of firewalls and Internet security, please see Internet Firewalls and Network Security by Karanjit Siyan and Chris Hare, published by New Riders Publishing.

There aren't yet many software-only firewalls for Windows NT, as most are based on UNIX. In the meantime, you might consider running a freeware version of UNIX for the purpose of including a firewall in your network.

Note

If the cost of a firewall has you worried, consider a much cheaper and more secure solution. Namely, avoid connecting your LAN to the Internet Web server altogether. Obviously, there are drawbacks to this approach, but it is the most secure approach. For one thing, it assumes your client machines won't need a connection to browse the Web. For another, you have to use sneaker-net (hand-carried floppy disks) to modify your HTML files on the Web server. This inconvenience can be reduced by physically reconnecting the network cable to the back of the Web server machine for a limited time during the day when you need to access it.

A firewall usually includes several software tools. For example, it might include separate proxy servers for e-mail, FTP, Gopher, Telnet, Web, and WAIS. The firewall can also filter certain outbound ICMP (Internet Control Message Protocol) packets so your server won't be capable of divulging network information.

Figure 28.1 shows a network diagram of a typical LAN connection to the Internet including a Web server and a firewall. Note that the Web server, LAN server, and firewall server could all be rolled into one machine if the budget is tight, but separating them as shown here is considered a safer environment.

Figure 28.1: Using a firewall/proxy server on a LAN.

The proxy server is used to mask all of your LAN IP addresses on outbound packets so they look like they all originated at the proxy server itself. Each of the client machines on your LAN must use the proxy server whenever they connect to the Internet for FTP, Telnet, Gopher, or the Web. The reason for doing this is to prevent outside detection of the structure of your network. Otherwise, hackers monitoring your outbound traffic would eventually be able to determine your individual IP addresses and then use IP spoofing to feed those back to your server when they want to appear as a known client.

Another purpose of a firewall is to perform IP filtering of incoming packets. Let's say that you have been monitoring the log files on your Web server and you keep noticing some unusual or unwanted activity originating from IP address x.3.5.9. After checking with the whois program (available on the Internet, for example, at http://www.winsite.com), you determine the domain name is bad.com, and you don't have any known reason to be doing business with them. You can configure the IP filter to block any connection attempts originating from bad.com while still allowing packets from the friendly good.com to proceed.

Caution

Many people think IP packet filtering is worthless if only implemented in software. They may advise you that packet filtering is useful only in a router or a hardware firewall solution that includes the capability to filter at the Link Layer, as opposed to the Network or Transport layers where the TCP/IP software operates. However, even if you do packet filtering via software, the trick is to filter based on both the source IP and the interface. That is, if a packet with a source address that is also an internal IP shows up on the external interface of your router (such as from the Internet side), you should drop the packet and have it logged for immediate attention.

Summary

This chapter has given you a brief taste of some of the issues involved in connecting the LAN to the Internet. This has by no means been a comprehensive treatment of the subject. I hope that it has been enough, however, to get you started with building your own site on the World Wide Web or in providing Internet access for all customers who need to go beyond the Intranet.

In the next and final chapter I will continue with the idea of building your own Web site by showing you a very handy Perl script that can be used to collect statistics of your visitors. When you think about it, you could even use the script on the Intranet. The chapter also provides a quick introduction to the Perl language and discusses other applications to which a Webmaster can apply Perl.

Chapter 28

Connecting the Intranet and the Internet

CONTENTS