CGI can do a lot, but not everything. Not all data resides on the local server or even on a UNIX box. Sometimes it is necessary to give the user an IOU for the data and then go produce it during off-peak hours or use the resources of other computers that are not always available. This chapter describes how to build batch jobs and submit them from CGI scripts.
This chapter describes how to run batch jobs on UNIX computers and also shows how to submit a batch job over a TCP/IP network to a midrange computer like an AS/400. These same principles can be used to send remote jobs over other networks, like those based on the LU 6.2 protocol, and to other computers, like mainframes running the MVS operating system.
Once upon a time there were no interactive terminals and there was no network. If you wanted a computer to do work for you, you prepared a deck of 80-column Hollerith cards, presented them to the high priests of the computer room, and went away and waited, and waited, and waited. Several hours later, you would find the printout from your "job" in a pigeon-hole next to the computer room and you could see whether everything had run as you expected.
Today, of course, we're very sophisticated. If we want the computer to do work for us, we tell it (with commands, or menu choices, or maybe through an HTML form) what to do, present it directly to the computer, and wait, and wait, and wait. Oh, sometimes we don't have to wait very long. But sometimes we do.
Some tasks are inherently slower than we would like them to be, and all the indications are that our expectations for speed will grow faster than the computer engineers' ability to produce faster computers. Some tasks don't have to be done in a hurry. Lots of back-office functions, like posting accounts receivable, can be run late at night when the system is lightly loaded.
On the Web, we've gotten used to having information at our fingertips. Every now and again, however, we find some process that takes so much time that we just can't ask the user to wait or we don't need the user to wait. In these cases, it makes sense to invoke batch processing. We work with the user to prepare a batch job and then submit the job and let the user go on about his business. If the user is expecting a response, we can send it to him by e-mail or put it in an FTP directory to be accessed later.
The well-known ratings company A.C. Nielsen has made good use of this technique because its product database is enormous. Queries can take many minutes so it encourages their subscribers to log in to the site and request a report. They run the report and place it in an FTP location from which the user can pick it up. Nielsen's site is located at http://www.nielsen.com/. Figure 12.1 shows a typical Nielsen query form. Figure 12.2 shows the response.
Figure 12.1 : Nielsen uses a query form to start the batch process.
Figure 12.2 : The batch job produces a detailed report which can be picked up by FTP.
Most large mainframes and mid-range computers have excellent batch-processing facilities. After all, these machines were developed primarily to service back-office functions such as accounting. IBM mainframes have an elaborate Remote Job Entry (RJE) facility. IBM AS/400 computers support multiple batch queues. A large part of the job of late-night computer operators is to keep jobs moving through the queues and to handle errors in real-time.
UNIX's batch-processing system is not nearly as sophisticated as those on mainframes or AS/400s. On some versions of UNIX, it's virtually nonexistent. In this section, we describe the batch command available in many versions of UNIX, particularly those descended from AT&T's SYS V.4. We also present the utilities at and cron, for those users whose version of UNIX does not support the batch command.
Users will never be as happy waiting for their output as they will be getting it right away. If the job is likely to complete quickly, just send the data back to the user. Recall that most users will not wait more than a few seconds for a response to a request. To find out how long a user is likely to wait, run the script from the command line and time it. Depending on the version of UNIX, the time or timex commands will give the desired information. For example,
export REQUEST_METHOD=POST export CONTENT_LENGTH=1024 time /usr/local/etc/httpd/cgi-bin/bigjob.cgi < bigjobInput.dat > /dev/null
will return this information
real 0m1.34s user 0m0.94s sys 0m0.40s
The top line, real 0m1.34s, reports how long it took
the program to complete in real-world time. The next two lines
show where the time went. Don't worry about the division of time
here. Tuning the sort of programs, which are likely to consume
large amounts of time (for example, database queries or large
simulations), is beyond the scope of this book. If the user
and sys times do not add up to something very close to
the real time, the system is already heavily loaded; the script
had to wait a while for other processes to run.
NOTE |
Also, don't worry too much about whether the script is in a compiled language like C or C++ or an interpreted language like Perl. For small programs, the time difference is inconsequential. For larger programs most of the time is spent in third-party applications like database managers. If the application is spending many seconds inside a Perl script, chances are there's a better way to design the application. |
If the real time from the time command reports that the process takes more than a second or so, the script is a good candidate for batching. Remember that batching doesn't just change the response time of the user who submitted the request. Recall that the server only has a certain number of copies of the httpd daemon to allocate. During the time one of those daemons is holding the channel open for a long transaction, it is not available to service other, shorter requests.
The basic idea behind batching is to run long complex jobs at a time when they can have the machine more or less to themselves. So the first task is to predict the load over the course of the day. This task is more complex than it sounds if the machine is used as a Web server. The Web is indeed a worldwide system. A popular site may be getting hits at all hours of the day and night. For this reason, it is best to keep heavyweight jobs off of the Web server.
Once you have identified the machine on which you want the batch job to run, the next step is to decide whether you will pick the time to run or whether you can let the system pick it. The UNIX utility that allows (simple) batch submission is called batch. batch accepts jobs from all users and puts them into a queue. The job at the head of the queue runs. When it completes, the next job runs and so forth. In this way, the system capacity is less likely to be exceeded, even if many users are submitting large jobs.
To find out if your target system has batch, log into that system and type
which batch
Expect the system to reply with a pathname, like /usr/bin/batch. If it does, the system has batch. If it doesn't, you will have to pick the time that the job will run. Even if your system has batch, you may want to explicitly defer the job so that it runs during non-peak hours. For example, your job may require end-of-the-day statistics that are not available until after local midnight.
The easiest way to submit a job to the UNIX batch facility from inside a Perl script is to put the commands to be batched into their own script. For this example, call that script bigjob.pl. Now, from inside the CGI script, write the parameters for bigjob.pl to a file with a unique name, such as bigjob.dat.84. (To make a unique name, consider using the counter script from Chapter 7 "Extending HTML's Capabilities with CGI.") For this example, assume that a unique identifier has been written into the variable $ID. The call from Perl is
system (rcp bigjob.dat.$ID host2:/home/myproject/bigjob.dat.$ID); exec (rsh host2 batch bigjob.pl < /home/myproject/bigjob.dat.$ID 2>/home/myproject/bigjob.err.$ID | mail $FORM{email}); # we only get here if exec fails. Write to standard error so # Webmaster sees this message in the error log. die ("Could not exec. Stopped$!");
The call to exec replaces the current script with the program batch. This construct is very efficient, but don't make this call until all the HTML has been sent back to the client. If exec succeeds, it never returns; therefore, nothing after exec executes.
Notice that we have used rcp and rsh to copy the command file to the remote machine and have it read into the bigjob script. If the remote host is not a UNIX machine, use another protocol such as ftp (and its associated macro).
Because the command runs on the remote machine, any errors will not be redirected back to the server's error log. Instead, we explicitly redirect standard error to a log file. The Webmaster can read those error logs or post-process them with a filter.
If the parameters are few, pass them on the command line, like the following:
exec ("rsh host 2 batch bigjob.pl $FORM{query} $FORM{database} /home/myproject/bigjob.err.$ID | mail $FORM{email}");
CAUTION |
When batch runs, it uses the directory and environment of the user who invoked it. When it is invoked by the Web server, it is usually running as user nobody. Be careful about making assumptions about the environment, the shell, or the user privileges. |
Some systems have software that's better than batch; that's not such a tall order. For example, AIX (IBM's flavor of UNIX) has a general queueing facility called qdaemon. Although it is generally thought of as a print-queue system, it can have any program installed at its head, such as a database client or a compiler. These facilities are usually far superior to batch; they allow multiple queues with priorities and provide the ability to start, stop, and restart jobs.
Sometimes, however, a system will have no batch facilities. In this case, the Webmaster can use other UNIX utilities, such as at and cron. The at utility runs a command at a specified time. The cron utility reads a table (crontab), which contains a schedule that will run a job once. cron will run the job on a regular basis. To help determine when to run each job, look at the system load over time.
First, look to see when other users on the machine schedule their batch jobs. Then avoid those times. To see what the job queue looks like, type
atq
The system will respond with a list of queued jobs. Look at the scheduled execution date. (It also includes the scheduled time.) If you see that 1 a.m. and 3 a.m. are popular, consider running your jobs at 2 a.m. or 4 a.m. or even 1:33 a.m.
On many systems you won't be able to see jobs other users have put in their personal crontabs. As a simple first-cut, pick odd times like 1:33 a.m. or 2:19 a.m. For a better look at system load-over time, work with the system administrator to run a performance-measurement system like sar or sa. sa is commonly found on versions of UNIX that descend from BSD UNIX; sar is more common on systems that trace their roots to AT&T's System V.4. The two standards are merging. Some systems support both sa and sar. If you have a choice, use sar. It's the more comprehensive of the two and it's easier to get sar to produce periodic reports.
To get sa to log system load every hour, enter a script like Listing 12.1 and tell cron to run it every hour.
Listing 12.1 List121.sh-A Shell Script to Produce Incremental sa Reports
#!/bin/sh filename=acct.dat.'date | cut -d' ' -f4 | cut -d':' -f1,2 | tr -d ':' sa=/usr/etc/sa myDirectory=/home/me/acct pathname=$myDirectory/$filename date > $pathname sa -i > $pathname
The -i option on sa tells it to produce an incremental report (it does not average the latest data with the summary data).
Each morning after setting this command to run, look at the system load (CPU time, average I/O operations, and average physical memory) during each hour. Each line represents a summary of the commands called during that hour. The first line is supposed to be a summary of all commands but it is not accurate in all versions of UNIX. Here is a typical line from an sa report.
0 0.00re 0.00cpu 0avio 0k
If your operating system has sar, use it to get the same data. The system administrator must activate accounting and arrange for the data collector sadc to run when the system is booted. The script sa1 is already supplied by the operating system vendor to collect statistics. To collect hourly statistics, put the following line in the crontab:
0 * * * * /usr/lib/sa/sa1
Now let sa1 run overnight. In the morning just type sar.
You will see a table showing every hour sadc logged.
Next to each hour, the log shows user, system, I/O, and idle time,
as shown in Table 12.1.
Finally, armed with sar or sa data, pick a time that appears to be lightly loaded. You may want to rerun the report over several nights and at a finer grain (say, every 20 minutes). Once you have found your time slot, set up your application to read and process each input file, one after the other, and have it start by cron at the appointed hour. Thus,
exec (rcp bigjob.dat.$ID host2:/home/myproject/bigjob.dat.$ID);
copies the data file to the remote machine. Arrange for the user's e-mail address to be included in the file. Then when cron runs bigjob, bigjob loops, looking into the /directory/home/myproject/ for files of the form bigjob.dat.nn, and when it finds such a file, it opens it, pulls out the administrative material, such as the return address, and runs the remaining elements (for example, a query against a database). It finishes by sending the output by e-mail to the user's e-mail address and then loops again. The program exits when there are no more input files.
Here's an example of how a full batch-processing system might work. Suppose GSH Real Estate arranged with the local Multiple Listing Service (MLS) to have its Web server place queries against the MLS database during the off hours. GSH is running on a UNIX box; in this example, MLS uses an IBM AS/400 that has TCP/IP installed on it.
During the day, GSH's clients could contact the Web server and fill out a form describing properties they want to find. When they submit the form, a CGI script runs and assembles a query file. GSH's system administrator has agreed with the system administrator of the AS/400 that they will start sending queries at 1 a.m., shortly after the MLS system pulls its daily backup and posts all changes to listings for that day. Because the AS/400 does not support the UNIX rcp and rsh commands, the system administrator uses FTP to transfer the file and invoke remote commands.
The AS/400 security officer sets up two accounts on the AS/400 for GSH's computer. One account, UNIX2400, is used by GSH's UNIX machine to send the files to the AS/400 and to initiate remote processing. The other account, AS2UNIX, picks up the report files produced by the AS/400.
The processing is started by a crontab entry on GSH's machine:
0 1 * * * /home/mls/upload.sh
The script of upload.sh is shown in Listing 12.2.
Listing 12.2 upload.sh-A Korn Shell Script to Copy Query Files to the AS/400
#!/bin/ksh #uncomment the following line to debug #set -x DIR=/home/mls PATH=$PATH:$DIR # lockout other processes--alas, not atomic in the shell # make sure umask is set so others cannot write to my files touch /tmp/upload.lock if [[ ! -w /tmp/upload.lock ]]; then return 1 fi # make sure log file exists touch /tmp/mls.CMD.log print "Begin upload_$(date)" >> /tmp/mls.CMD.log # make sure there is enough disk space, where enough = 100 * size of UXQURY # we make the assumption here that for every line in UXQURY we may get # 100 lines of response requiredSpace.ksh UXQURY | enoughDisk.ksh /tmp if [[ $? -ne 0 ]] then exit fi requiredSpace.ksh UXQURY | enoughDisk.ksh /var if [[ $? -ne 0 ]] then exit fi requiredSpace.ksh UXQURY | enoughDisk.ksh /home if [[ $? -ne 0 ]] then exit fi #upload queries to the host cp netrc4 .netrc print "Begin FTP\t$(date)" >> /tmp/mls.CMD.log time ftp as400 > ftp.out 2>&1 print "Done FTP\t$(date)" >> /tmp/mls.CMD.log
This script checks to make sure there is enough disk capacity to handle the replies. If not, it logs an error and exits, so the queries may be run by hand by the system administrator after he has freed up some disk space. Note that if there's not enough disk space to handle the response, you don't risk querying the database and losing the response.
Next, the script moves a canned FTP script (netrc4, shown in Listing 12.3) into position for use by FTP. When FTP runs, it looks for the file .netrc. If it finds it, and if it matches the host it is trying to reach, FTP follows the script in .netrc. Here is netrc4, which is used as .netrc during the upload:
Listing 12.3 netrc4-An rc File Telling FTP to Move the Query File and Run a Remote Command
machine as400 login UNIX2400 password xyzzy macdef init put /home/mls/data/uxqury UXQURY.UXQURY quote RCMD UXCONV CONV(*FROMUNIX) OPT(1) REPLACE(*YES) quit
This file says that when FTP is told to log into a machine called as400, it should do so using the specified account and password. Once logged in, FTP runs the following macro definition:
At 3 a.m., the cron daemon on GSH's UNIX machine runs download.sh, which pulls the response files back from the MLS AS/400 to the Web server. Listing 12.4 shows download.sh.
Listing 12.4 download.sh-This Script Retrieves Responses from the AS/400
#!/bin/ksh #set -x DIR=/home/mls PATH=$PATH:$DIR # lockout other processes--alas, not atomic in the shell # make sure umask is set so others cannot write to my files touch /tmp/download.lock if [[ ! -w /tmp/download.lock ]] then return 1 fi # make sure log file exists touch /tmp/mls.CMD.log print "Begin download_$(date)" >> /tmp/mls.CMD.log # clean up some from before starting, need all the room we can get rm ./data/uxqury #insure there is a current copy touch UXRESP #download from the host cp netrc5 .netrc print "Begin FTP\t$(date)" >> /tmp/mls.CMD.log time ftp as400 > ftp.out 2>&1 print "Done FTP\t$(date)" >> /tmp/mls.CMD.log # create as though downloaded if not downloaded touch ./data/UXRESP # prove a useful download occurred (maybe ftp timed out) if [[ -s UXRESP ]]; then print "Download failed to download responses" return 1 fi_ nice mailOut.pl < ./data/UXRESP # record all commands going to mailOut for safety and validation cat ./data/UXRESP >> /tmp/mls.CMD.log rm /tmp/download.lock print "Done download_$(date)" > junk cat junk >> /tmp/mls.CMD.log
In this case, the FTP macro file, shown in Listing 12.5, says
Listing 12.5 netrc5-A Macro File to Retrieve the Responses
machine as400 login host6000 password xyzzy macdef init quote RCMD UXCONV CONV(*TOUNIX) OPT(1) REPLACE(*YES) get UXRESP.UXRESP /home/mls/data/UXRESP quit
This macro is similar to the upload macro. The UNIX machine logs in to the AS/400 using FTP and sends a remote command to convert the file named by option 1 (UXRESP) to UNIX format, replacing any previous copy of the file. Then it invokes FTP's get command to retrieve the file from the AS/400 and transfer it to a particular path on the UNIX machine. It is from this location that the Perl script mailout reads that file line-by-line and sends e-mail to the named users.
Here's the file format for UXQURY. In this notation, X denotes any alphanumeric character, (nn) denotes a field width, and 9 denotes any number. Thus X(20) is an alphanumeric fields 20 characters wide. A numeric field two characters wide is shown as 99.
UserRealName X(20) UserEMail X(30) DesiredSection 99 Bedrooms 9 Bath 99 AskingPrice (in hundreds) 99999
Here's the file format for UXRESP:
UserRealName X(20) UserEMail X(30) Section 99 Address X(30) MLS Number X(10) Bedrooms 9 Bath 99 AskingPrice (in hundreds) 99999
So a Web user can fill out a form like the one in Figure 12.3 and generate a line in that night's query file like this one
Figure 12.3 : The visitor can fill out a form to submit a batch job.
John R. Jones JJones@xyz.com 2732501235
which says that John R. Jones (JJones@xyz.com) is looking for a three-bedroom, 2.5 bath home in section 27 of town. He is willing to consider homes priced as high as $123,500.
Here is a response the AS/400 might send back:
John R. Jones JJones@xyz.com 271234 Smith Street 213469800042501220 John R. Jones JJones@xyz.com 272345 Juniper Drive 299921415532501230 John R. Jones JJones@xyz.com 271234 Cypress Way 320814706132501990
This says that the database search engine found three homes matching the desired criteria. The first is a four-bedroom, 2.5 bath, offered for $122,000. The next two are three-bedroom, 2.5 bath homes being offered for $123,000 and $119,900, respectively. The MLS numbers are given. If the Web server has the photo on file, it could display it.
Listing 12.6 shows mailout.pl, which is used to e-mail the results of the batch job to the user who requested that the query be run.
Listing 12.6 mailout.pl-Reads the Batch Results from STDIN and Send Them Out by E-Mail
#!/usr/bin/perl #Mailout.pl $siteOwner = "Christopher Kepilino"; $ownerEmail = "kepilino\@dse.com"; $oldUserNameAndEmail = ""; # pipe the UXRESP file into Mailout while (<STDIN>) { # Break apart the record into its fields ($userRealName, $userEMail, $section, $address, $MLSNumber, $bedrooms, $bath, $askingPriceInHundreds) = unpack("A20 A30 A2 A30 A10 A1 A2 A5", $_); # Is this a new user? if ($oldUserNameAndEmail ne $userRealName . $userEmail) { # let this open force the old MAIL closed open (MAIL, "| sendmail -t"); print MAIL "To: $userRealName <$userEMail>\n"; print MAIL "From: $siteOwner <$ownerEmail>\n"; print MAIL "Subject: Your Real Estate Query\n"; $sectionName = §ion($section); print MAIL "Thank you for your inquiry about real estate in $sectionName.\n"; print MAIL "Here are some listings for your review.\n\n"; print MAIL "Please call our office at 555-1212 or reply to this e-mail "; print MAIL "for more information on any of these properties.\n\n"; print MAIL "Warmest regards,\n"; print MAIL "Chris Kepilino\n\n"; print MAIL "MLS Number Address Bedrooms Baths Asking Price\n"; print MAIL "___________ _______________________________ _ ___ _________\n"; $oldUserNameAndEmail = $userRealName . $userEmail; } $bathFormatted = $bath / 10; $askingPriceFormatted = &commas($askingPriceInHundreds * 100); printf MAIL "%10s %-30s %-8s %-5s \$%8s\n", $MLSNumber, $address, $bedrooms, $bathFormatted, $askingPriceFormatted; } close MAIL; exit; sub section { (local $theSection) = @_; if ($theSection == 1) { $result = "Bayview"; } elsif ($theSection == 12) { $result = "Ocean View"; } elsif ($theSection == 27) { $result = "Glenwood"; } else { $result = "an Unknown Section"; } $result; } sub commas { local($_) = @_; 1 while s/(.*\d)(\d\d\d)/$1,$2/; $_; }
A typical message from mailout.pl looks like this:
From mikem Sun May 12 12:46:28 1996 Date: Sun, 12 May 1996 12:46:28 -0500 To: John R. Jones <root> From: Christopher Kepilino <kepilino@dse.com> Subject: Your Real Estate Query Thank you for your inquiry about real estate in Glenwood. Here are some listings for your review. Please call our office at 555-1212 or reply to this e-mail for more information on any of these properties. Warmest regards, Chris Kepilino MLS Number Address Bedrooms Baths Asking Price __________ ______________________________ _ ___ _________ 2134698000 1234 Smith Street 4 2.5 $ 122,000 2134698000 2345 Juniper Drive 4 2.5 $ 123,000
In an ideal world, computers would be blindingly fast, and all our queries would be answered instantly. In the real world, most queries are answered quickly, but some searches through large databases can take longer than the typical Web user will wait. One solution is to process the user's request with a batch job, then send the user the results by e-mail or make the finished report available on an FTP server.