-->
Page 561
The match(string, reg) function determines whether string contains the set of characters defined by reg. If there is a match, the position is returned, and the variables RSTART and RLENGTH are set.
The printf(format, variables) function writes formatted data converting variables based on the format string. This function is very similar to the C printf() function. More information about this function and the formatting strings is provided in the section "printf" later in this chapter.
The split(string, store, delim) function splits string into elements of the array store based on the delim string. The number of elements in store is returned. If you omit the delim string, FS is used. To split a slash (/) delimited date into its component parts, code the following:
split("08/12/1962", results, "/");
After the function call, results[1] contains 08, results[2] contains 12, and results[3] contains 1962. When used with the split function, the array begins with the element one. This also works with strings that contain text.
The sprintf(format, variables) function behaves like the printf function except that it returns the result string instead of writing output. It produces formatted data converting variables based on the format string. This function is very similar to the C sprintf() function. More information about this function and the formatting strings is provided in the "printf" section of this chapter.
The strftime(format, timestamp) function returns a formatted date or time based on the format string; timestamp is the number of seconds since midnight on January 1, 1970. The systime function returns a value in this form. The format is the same as the C strftime() function.
The sub(reg, string, target) function allows you to substitute the one set of characters for the first occurrence of another (defined in the form of the regular expression reg) within string. The number of substitutions is returned by the function. If target is omitted, the input record, $0, is the target. This is patterned after the substitute command in the ed text editor.
The substr(string, position, len) function allows you to extract a substring based on a starting position and length. If you omit the len parameter, the remaining string is returned.
The tolower(string) function returns the uppercase alphabetic characters in string converted to lowercase. Any other characters are returned without any conversion.
The toupper(string) function returns the lowercase alphabetic characters in string converted to uppercase. Any other characters are returned without any conversion.
awk supports special string constants that cannot be entered from the keyboard or have special meaning. If you wanted to have a double quote (") character as a string constant (x = """), how would you prevent awk from thinking the second one (the one you really want) is the end
Page 562
of the string? The answer is by escaping, or telling awk that the next character has special meaning. This is done through the backslash (\) character, as in the rest of UNIX.
Table 27.6 shows most of the constants that gawk supports.
Table 27.6. gawk special string constants.
Expression | Meaning |
\\ | The means of including a backslash |
\a | The alert or bell character |
\b | Backspace |
\f | Formfeed |
\n | Newline |
\r | Carriage return |
\t | Tab |
\v | Vertical tab |
\" | Double quote |
\xNN | Indicates that NN is a hexadecimal number |
\0NNN | Indicates that NNN is an octal number |
When you have more than one related piece of data, you have two choicesyou can create multiple variables, or you can use an array. An array enables you to keep a collection of related data together.
You access individual elements within an array by enclosing the subscript within square brackets ([]). In general, you can use an array element any place you can use a regular variable.
Arrays in awk have special capabilities that are lacking in most other languages: They are dynamic, they are sparse, and the subscript is actually a string. You don't have to declare a variable to be an array, and you don't have to define the maximum number of elementswhen you use an element for the first time, it is created dynamically. Because of this, a block of memory is not initially allocated; in normal programming practice, if you want to accumulate sales for each month in a year, 12 elements will be allocated, even if you are only processing December at the moment. awk arrays are sparse; if you are working with December, only that element will exist, not the other 11 (empty) months.
In my experience, the last capability is the most usefulthe subscript being a string. In most programming languages, if you want to accumulate data based on a string (like totaling sales by state or country), you need to have two arraysthe state or country name (a string) and the
Page 563
numeric sales array. You search the state or country name for a match and then use the same element of the sales array. awk performs this for you. You create an element in the sales array with the state or country name as the subscript and address it directly like the following:
total_sales["Pennsylvania"] = 10.15
Much less programming and much easier to read (and maintain) than the search one array and change another method. This is known as an associative array.
However, awk does not directly support multidimension arrays.
gawk provides a couple of functions specifically for use with arrays: in and delete. The in function tests for membership in an array. The delete function removes elements from an array.
If you have an array with a subscript of states and want to determine if a specific state is in the list, you would put the following within a conditional test (more about conditional tests in the "Conditional Flow" section):
"Delaware" in total_sales
You can also use the in function within a loop to step through the elements in an array (especially if the array is sparse or associative). This is a special case of the for loop and is described in the section "The for statement," later in the chapter.
To delete an array element (the state of Delaware, for example), you code the following:
delete total_sales["Delaware"]
CAUTION |
When an array element is deleted, it has been removed from memory. The data is no longer available. |
It is always good practice to delete elements in an array, or entire arrays, when you are done with them. Although memory is cheap and large quantities are available (especially with virtual memory), you will eventually run out if you don't clean up.
NOTE |
You must loop through all loop elements and delete each one. You cannot delete an entire array directly; the following is not valid: delete total_sales |