-->

Previous | Table of Contents | Next

Page 545

CHAPTER 27

gawk Programming

by David B. Horvath, CCP

IN THIS CHAPTER

Page 546

gawk, or GNU awk, is one of the newer versions of the awk programming language created for UNIX by Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan in 1977. The name awk comes from the initials of the creators' last names. Kernighan was also involved with the creation of the C programming language and UNIX; Aho and Weinberger were involved with the development of UNIX. Because of their backgrounds, you will see many similarities between awk and C.

There are several versions of awk: the original awk, nawk, POSIX awk, and of course, gawk. nawk was created in 1985 and is the version described in The awk Programming Language (see the complete reference to this book later in the chapter in the section "Summary"). POSIX awk is defined in the IEEE Standard for Information Technology, Portable Operating System Interface, Part 2: Shell and Utilities Volume 2, ANSI-approved April 5, 1993 (IEEE is the Institute of Electrical and Electronics Engineers, Inc.). GNU awk is based on POSIX awk.

The awk language (in all of its versions) is a pattern-matching and processing language with a lot of power. It will search a file (or multiple files) searching for records that match a specified pattern. When a match is found, a specified action is performed. As a programmer, you do not have to worry about opening, looping through the file reading each record, handling end-of-file, or closing it when done. These details are handled automatically for you.

It is easy to create short awk programs because of this functionality—many of the details are handled by the language automatically. There are also many functions and built-in features to handle many of the tasks of processing files.

Applications

There are many possible uses for awk, including extracting data from a file, counting occurrences of within a file, and creating reports.

The basic syntax of the awk language matches the C programming language; if you already know C, you know most of awk. In many ways, awk is an easier version of C because of the way it handles strings and arrays (dynamically). If you do not know C yet, learning awk will make learning C a little easier.

awk is also very useful for rapid prototyping or trying out an idea that will be implemented in another language like C. Instead of your having to worry about some of the minute details, the built-in automation takes care of them. You worry about the basic functionality.

TIP
awk works with text files, not binary. Because binary data can contain values that look like record terminators (newline characters)—or not have any at all—awk will get confused. If you need to process binary files, look into Perl or use a traditional programming language like C.

Page 547

Features

As is the UNIX environment, awk is flexible, contains predefined variables, automates many of the programming tasks, provides the conventional variables, supports the C-formatted output, and is easy to use. awk lets you combine the best of shell scripts and C programming.

There are usually many different ways to perform the same task within awk. Programmers get to decide which method is best suited to their applications. With the built-in variables and functions, many of the normal programming tasks are automatically performed. awk will automatically read each record, split it up into fields, and perform type conversions whenever needed. The way a variable is used determines its type—there is no need (or method) to declare variables of any type.

Of course, the "normal" C programming constructs like if/else, do/while, for, and while are supported. awk doesn't support the switch/case construct. It supports C's printf() for formatted output and also has a print command for simpler output.

awk Fundamentals

Unlike some of the other UNIX tools (shell, grep, and so on), awk requires a program (known as an "awk script"). This program can be as simple as one line or as complex as several thousand lines. (I once developed an awk program that summarizes data at several levels with multiple control breaks; it was just short of 1000 lines.)

The awk program can be entered a number of ways—on the command line or in a program file. awk can accept input from a file, piped in from another program, or even directly from the keyboard. Output normally goes to the standard output device, but that can be redirected to a file or piped into another program. Output can also be sent directly to a file instead of standard output.

Using awk from the Command Line

The simplest way to use awk is to code the program on the command line, accept input from the standard input device (keyboard), and send output to the standard output device (screen). Listing 27.1 shows this in its simplest form; it prints the number of fields in the input record along with that record.

Listing 27.1. Simplest use of awk.

$ gawk `{print NF ": " $0}'

Now is the time for all

Good Americans to come to the Aid

of Their Country.

Ask not what you can do for awk, but rather what awk can do for you.

Ctrl+d


                                             continues

Page 548

Listing 27.1. continued

6: Now is the time for all

7: Good Americans to come to the Aid

3: of Their Country.

16: Ask not what you can do for awk, but rather what awk can do for you.

$ _

NOTE
Ctrl+D is one way of showing that you should press (and hold) the Ctrl (or Control) key and then press the D key. This is the default end-of-file key for UNIX. If this doesn't work on your system, use stty -a to determine which key to press. Another way this action or key is shown on the screen is ^d.
The entire awk script is contained within single quotes (`) to prevent the shell from interpreting its contents. This is a requirement of the operating system or shell, not the awk language.

NF is a predefined variable that is set to the number of fields on each record. $0 is that record. The individual fields can be referenced as $1, $2, and so on.

You can also store your awk script in a file and specify that filename on the command line by using the -f flag. If you do that, you don't have to contain the program within single quotes.

NOTE
gawk and other versions of awk that meet the POSIX standard support the specification of multiple programs through the use of multiple -f options. This allows you to execute multiple awk programs on the same input. Personally, I tend to avoid this just because it gets a bit confusing.

You can use the normal UNIX shell redirection or just specify the filename on the command line to accept the input from a file instead of the keyboard:

gawk `{print NF ": " $0}' < inputs

gawk `{print NF ": " $0}' inputs

Multiple files can be specified by just listing them on the command line as shown in the second form above—they will be processed in the order specified. Output can be redirected through the normal UNIX shell facilities to send it to a file or pipe it into another program:

gawk `{print NF ": " $0}' > outputs

gawk `{print NF ": " $0}' | more

Of course, both input and output can be redirected at the same time.

Previous | Table of Contents | Next