-->
Previous Table of Contents Next


Part V
Linux for Programmers

In This Part
•   gawk
•   Programming in C
•   Programming in C++
•   Perl
•   Introduction to Tcl and Tk
•   Other Compilers
•   Smalltalk/X

Chapter 25
gawk

by Tim Parker

In This Chapter
•  What is the gawk language?
•  Files, records, and fields
•  Pattern-action pairs
•  Calling gawk programs
•  Control structures

The awk programming language was created by the three people who gave their last-name initials to the language: Alfred Aho, Peter Weinberger, and Brian Kernighan. The gawk program included with Linux is the GNU implementation of that programming language.

The gawk language is more than just a programming language; it is an almost indispensable tool for many system administrators and UNIX programmers. The language itself is easy to learn, easy to master, and amazingly flexible. After you get the hang of using gawk, you’ll be surprised how often you can use it for routine tasks on your system.

To help you understand gawk, we will follow a simple order of introducing the elements of the programming language, as well as showing good examples. You are encouraged, or course, to experiment as the chapter progresses. It’s not possible to cover all the different aspects and features of gawk in this chapter, but we will look at the basics of the language and show you enough, hopefully, to get your curiosity working.

What Is the gawk Language?

gawk is designed to be an easy-to-use programming language that lets you work with information either stored in files or piped to them. The main strengths of gawk are its capabilities to do the following:

  Display some or all the contents of a file, selecting rows, columns, or fields as necessary.
  Analyze text for frequency of words, occurrences, and so on.
  Prepare formatted output reports based on information in a file.
  Filter text in a very powerful manner.
  Perform calculations with numeric information from a file.

gawk isn’t difficult to learn. In many ways, gawk is the ideal first programming language because of its simple rules, basic formatting, and standard usage. Experienced programmers will find gawk refreshingly easy to use.

Files, Records, and Fields

Usually, gawk works with data stored in files. Often this is numeric data, but gawk can work with character information, too. If data is not stored in a file, it is supplied to gawk through a pipe or other form of redirection. Only ASCII files (text files) can be properly handled with gawk. Although it does have the capability to work with binary files, the results are often unpredictable. Because most information on a Linux system is stored in ASCII, this isn’t a problem.

As a simple example of a file that gawk works with, consider a telephone directory. It is composed of many entries, all with the same format: last name, first name, address, telephone number. The entire telephone directory is a database of sorts, although without a sophisticated search routine. Indeed, the telephone directory relies on a pure alphabetical order to enable users to search for the data they need.

Each line in the telephone directory is a complete set of data on its own and is called a record. For example, the entry in the telephone directory for “Smith, John,” which includes his address and telephone number, is a record.

Each piece of information in the record—the last name, the first name, the address, and the telephone number—is called a field. For the gawk language, the field is a single piece of information. A record, then, is a number of fields that pertain to a single item. A set of records makes up a file.

In most cases, fields are separated (delineated) by a character that is used only to separate fields, such as a space, a tab, a colon, or some other special symbol. This character is called a field separator. A good example is the file /etc/passwd, which looks like this:


tparker:t36s62hsh:501:101:Tim Parker:/home/tparker:/bin/bash

etreijs:2ys639dj3h:502:101:Ed Treijs:/home/etreijs:/bin/tcsh

ychow:1h27sj:503:101:Yvonne Chow:/home/ychow:/bin/bash

If you look carefully at the file, you can see that it uses a colon as the field separator. Each line in the /etc/passwd file has seven fields: the username, the password, the user ID, the group ID, a comment field, the home directory, and the startup shell. Each field is separated by a colon. Colons exist only to separate fields. A program looking for the sixth field in any line needs only count five colons across (because the first field doesn’t have a colon before it).

That’s where we find a problem with the gawk definition of fields as they pertain to the telephone directory example. Consider the following lines from a telephone directory:


Smith, John      13 Wilson St.                 555-1283

Smith, John      2736 Artside Dr, Apt 123      555-2736

Smith, John      125 Westmount Cr              555-1728

We “know” there are four fields here: the last name, the first name, the address, and the telephone number. But gawk doesn’t see it that way. The telephone book uses the space character as a field separator, so on the first line it sees “Smith” as the first field, “John” as the second, “13” as the third, “Wilson” as the fourth, and so on. As far as gawk is concerned, the first line when using a space character as a field separator has six fields. The second line has eight fields. Whitespace (spaces and tabs) in the preceding example are ignored by gawk as being just more characters with no special meanings. Unless you change the field separator to a space or tab character, whitespace has no meaning to gawk.


Tip:  
When working with a programming language, you must consider data the way the language will see it. Remember that programming languages take things literally.


Previous Table of Contents Next