-->
Page 558
You can control the case sensitivity of gawk regular expressions with the IGNORECASE variable. When set to the default, zero, pattern matching checks the case in regular expressions. If you set it to a nonzero value, case is ignored. (The letter A will match to the letter a.)
The variable NF is set after each record is read and contains the number of fields. The fields are determined by the FS or FIELDWIDTHS variables.
The variable NR contains the total number of records read. It is never less than FNR, which is reset to zero for each file.
The default output format for numbers is stored in OFMT and defaults to the format string "%.6g". See the section "printf" for more information on the meaning of the format string.
The output field separator is contained in OFS with a default of space. This is the character
or string that is output whenever you use a comma with the print statement, such as the
following:
{print $1, $2, $3;}
This statement print the first three fields of a file separated by spaces. If you want to separate them by colons (like the /etc/passwd file), you simply set OFS to a new value: OFS=":".
You can change the output record separator by setting ORS to a new value. ORS defaults to the newline character (\n).
The length of any string matched by the match() function call is stored in RLENGTH. This is used in conjunction with the RSTART predefined variable to extract the matched string.
You can change the input record separator by setting RS to a new value. RS defaults to the newline character (\n).
The starting position of any string matched by the match() function call is stored in RSTART. This is used in conjunction with the RLENGTH predefined variable to extract the matched string.
The SUBSEP variable contains the value used to separate subscripts for multidimension arrays. The default value is "\034", which is the double quote character (").
NOTE |
If you change a field ($1, $2, and so on) or the input record ($0), you will cause other predefined variables to change. If your original input record had two fields and you set $3="third one", then NF would be changed from 2 to 3. |
awk supports two general types of variables: numeric (which can consist of the characters 0 through 9, + or -, and the decimal [.]) and character (which can contain any character). Variables
Page 559
that contain characters are generally referred to as strings. A character string can contain a valid number, text like words, or even a formatted phone number. If the string contains a valid number, awk can automatically convert and use it as if it were a numeric variable; if you attempt to use a string that contains a formatted phone number as a numeric variable, awk will attempt to convert and use it as it were a numeric variablethat contains the value zero.
A string constant is always enclosed within the double quotes ("") and can be from zero (an empty string) to many characters long. The exact maximum varies by version of UNIX; personally, I have never hit the maximum. The double quotes aren't stored in memory. A typical string constant might look like the following:
"UNIX Unleashed, Second Edition"
You have already seen string constants used earlier in this chapterwith comparisons and the print statement.
There is really only one string operator and that is concatenation. You can combine multiple strings (constants or variables in any combination) by just putting them together. Listing 27.1 does this with the print statement where the string ": " is prepended to the input record ($0).
Listing 27.3 shows a couple ways to concatenate strings.
Listing 27.3. Concatenating strings example.
gawk `BEGIN{x="abc""def"; y="ghi"; z=x y; z2 = "A"x"B"y"C"; print x, y, z, z2}' abcdef ghi abcdefghi AabcdefBghiC
Variable x is set to two concatenated strings; it prints as abcdef. Variable y is set to one string for use with the variable z. Variable z is the concatenation of two string variables printing as abcdefghi. Finally, the variable z2 shows the concatenation of string constants and string variables printing as AabcdefBghiC.
If you leave the comma out of the print statement, all the strings will be concatenated together and will look like the following:
abcdefghiabcdefghiAabcdefBghiC
In addition to the one string operation (concatenation), gawk provides a number of functions for processing strings.
Table 27.5 summarizes the built-in string functions in gawk. Earlier versions of awk don't support all these functions.
Page 560
Table 27.5. gawk built-in string functions.
Function | Purpose |
gsub(reg, string, target) |
Substitutes
string in target string every time the regular expression reg is matched |
index(search, string) | Returns the position of the search string in string |
length(string) | The number of characters in string |
match(string, reg) |
Returns the position in
string that matches the regular expression reg |
printf(format, variables) |
Writes formatted data based on
format; variables is the data you want printed |
split(string, store, delim) |
Splits string into array elements of
store based on the delimiter delim |
sprintf(format, variables) |
Returns a string containing formatted data based
on format; variables is the data you want placed in the string |
strftime(format, timestamp) |
Returns a formatted date or time
string based on format; timestamp is the time returned by the systime() function |
sub(reg, string, target) |
Substitutes string in
target string the first time the regular expression reg is matched |
substr(string, position, len) |
Returns a substring beginning at
position for len number of characters |
tolower(string) |
Returns the characters in
string as their lowercase equivalent |
toupper(string) |
Returns the characters in
string as their uppercase equivalent |
The gsub(reg, string, target) function allows you to globally substitute one set of characters for another (defined in the form of the regular expression reg) within string. The number of substitutions is returned by the function. If target is omitted, the input record, $0, is the target. This is patterned after the substitute command in the ed text editor.
The index(search, string) function returns the first position (counting from the left) of the search string within string. If string is omitted, 0 is returned.
The length(string) function returns a count of the number of characters in string. awk keeps track of the length of strings internally.