-->
Previous | Table of Contents | Next |
by Tim Parker
The awk programming language was created by the three people who gave their last-name initials to the language: Alfred Aho, Peter Weinberger, and Brian Kernighan. The gawk program included with Linux is the GNU implementation of that programming language.
The gawk language is more than just a programming language; it is an almost indispensable tool for many system administrators and UNIX programmers. The language itself is easy to learn, easy to master, and amazingly flexible. After you get the hang of using gawk, youll be surprised how often you can use it for routine tasks on your system.
To help you understand gawk, we will follow a simple order of introducing the elements of the programming language, as well as showing good examples. You are encouraged, or course, to experiment as the chapter progresses. Its not possible to cover all the different aspects and features of gawk in this chapter, but we will look at the basics of the language and show you enough, hopefully, to get your curiosity working.
gawk is designed to be an easy-to-use programming language that lets you work with information either stored in files or piped to them. The main strengths of gawk are its capabilities to do the following:
gawk isnt difficult to learn. In many ways, gawk is the ideal first programming language because of its simple rules, basic formatting, and standard usage. Experienced programmers will find gawk refreshingly easy to use.
Usually, gawk works with data stored in files. Often this is numeric data, but gawk can work with character information, too. If data is not stored in a file, it is supplied to gawk through a pipe or other form of redirection. Only ASCII files (text files) can be properly handled with gawk. Although it does have the capability to work with binary files, the results are often unpredictable. Because most information on a Linux system is stored in ASCII, this isnt a problem.
As a simple example of a file that gawk works with, consider a telephone directory. It is composed of many entries, all with the same format: last name, first name, address, telephone number. The entire telephone directory is a database of sorts, although without a sophisticated search routine. Indeed, the telephone directory relies on a pure alphabetical order to enable users to search for the data they need.
Each line in the telephone directory is a complete set of data on its own and is called a record. For example, the entry in the telephone directory for Smith, John, which includes his address and telephone number, is a record.
Each piece of information in the recordthe last name, the first name, the address, and the telephone numberis called a field. For the gawk language, the field is a single piece of information. A record, then, is a number of fields that pertain to a single item. A set of records makes up a file.
In most cases, fields are separated (delineated) by a character that is used only to separate fields, such as a space, a tab, a colon, or some other special symbol. This character is called a field separator. A good example is the file /etc/passwd, which looks like this:
tparker:t36s62hsh:501:101:Tim Parker:/home/tparker:/bin/bash etreijs:2ys639dj3h:502:101:Ed Treijs:/home/etreijs:/bin/tcsh ychow:1h27sj:503:101:Yvonne Chow:/home/ychow:/bin/bash
If you look carefully at the file, you can see that it uses a colon as the field separator. Each line in the /etc/passwd file has seven fields: the username, the password, the user ID, the group ID, a comment field, the home directory, and the startup shell. Each field is separated by a colon. Colons exist only to separate fields. A program looking for the sixth field in any line needs only count five colons across (because the first field doesnt have a colon before it).
Thats where we find a problem with the gawk definition of fields as they pertain to the telephone directory example. Consider the following lines from a telephone directory:
Smith, John 13 Wilson St. 555-1283 Smith, John 2736 Artside Dr, Apt 123 555-2736 Smith, John 125 Westmount Cr 555-1728
We know there are four fields here: the last name, the first name, the address, and the telephone number. But gawk doesnt see it that way. The telephone book uses the space character as a field separator, so on the first line it sees Smith as the first field, John as the second, 13 as the third, Wilson as the fourth, and so on. As far as gawk is concerned, the first line when using a space character as a field separator has six fields. The second line has eight fields. Whitespace (spaces and tabs) in the preceding example are ignored by gawk as being just more characters with no special meanings. Unless you change the field separator to a space or tab character, whitespace has no meaning to gawk.
Tip:
When working with a programming language, you must consider data the way the language will see it. Remember that programming languages take things literally.
Previous | Table of Contents | Next |