-->

Previous | Table of Contents | Next

Page 558

You can control the case sensitivity of gawk regular expressions with the IGNORECASE variable. When set to the default, zero, pattern matching checks the case in regular expressions. If you set it to a nonzero value, case is ignored. (The letter A will match to the letter a.)

The variable NF is set after each record is read and contains the number of fields. The fields are determined by the FS or FIELDWIDTHS variables.

The variable NR contains the total number of records read. It is never less than FNR, which is reset to zero for each file.

The default output format for numbers is stored in OFMT and defaults to the format string "%.6g". See the section "printf" for more information on the meaning of the format string.

The output field separator is contained in OFS with a default of space. This is the character
or string that is output whenever you use a comma with the print statement, such as the
following:

{print $1, $2, $3;}

This statement print the first three fields of a file separated by spaces. If you want to separate them by colons (like the /etc/passwd file), you simply set OFS to a new value: OFS=":".

You can change the output record separator by setting ORS to a new value. ORS defaults to the newline character (\n).

The length of any string matched by the match() function call is stored in RLENGTH. This is used in conjunction with the RSTART predefined variable to extract the matched string.

You can change the input record separator by setting RS to a new value. RS defaults to the newline character (\n).

The starting position of any string matched by the match() function call is stored in RSTART. This is used in conjunction with the RLENGTH predefined variable to extract the matched string.

The SUBSEP variable contains the value used to separate subscripts for multidimension arrays. The default value is "\034", which is the double quote character (").

NOTE
If you change a field ($1, $2, and so on) or the input record ($0), you will cause other predefined variables to change. If your original input record had two fields and you set $3="third one", then NF would be changed from 2 to 3.

Strings

awk supports two general types of variables: numeric (which can consist of the characters 0 through 9, + or -, and the decimal [.]) and character (which can contain any character). Variables

Page 559

that contain characters are generally referred to as strings. A character string can contain a valid number, text like words, or even a formatted phone number. If the string contains a valid number, awk can automatically convert and use it as if it were a numeric variable; if you attempt to use a string that contains a formatted phone number as a numeric variable, awk will attempt to convert and use it as it were a numeric variable—that contains the value zero.

String Constants

A string constant is always enclosed within the double quotes ("") and can be from zero (an empty string) to many characters long. The exact maximum varies by version of UNIX; personally, I have never hit the maximum. The double quotes aren't stored in memory. A typical string constant might look like the following:

"UNIX Unleashed, Second Edition"

You have already seen string constants used earlier in this chapter—with comparisons and the print statement.

String Operators

There is really only one string operator and that is concatenation. You can combine multiple strings (constants or variables in any combination) by just putting them together. Listing 27.1 does this with the print statement where the string ": " is prepended to the input record ($0).

Listing 27.3 shows a couple ways to concatenate strings.

Listing 27.3. Concatenating strings example.

gawk `BEGIN{x="abc""def"; y="ghi"; z=x y; z2 = "A"x"B"y"C"; print x, y, z, z2}'

abcdef ghi abcdefghi AabcdefBghiC

Variable x is set to two concatenated strings; it prints as abcdef. Variable y is set to one string for use with the variable z. Variable z is the concatenation of two string variables printing as abcdefghi. Finally, the variable z2 shows the concatenation of string constants and string variables printing as AabcdefBghiC.

If you leave the comma out of the print statement, all the strings will be concatenated together and will look like the following:

abcdefghiabcdefghiAabcdefBghiC

Built-in String Functions

In addition to the one string operation (concatenation), gawk provides a number of functions for processing strings.

Table 27.5 summarizes the built-in string functions in gawk. Earlier versions of awk don't support all these functions.

Page 560

Table 27.5. gawk built-in string functions.

Function Purpose
gsub(reg, string, target) Substitutes string in target string every time the
regular expression reg is matched
index(search, string) Returns the position of the search string in string
length(string) The number of characters in string
match(string, reg) Returns the position in string that matches the
regular expression reg
printf(format, variables) Writes formatted data based on format; variables is
the data you want printed
split(string, store, delim) Splits string into array elements of store based on
the delimiter delim
sprintf(format, variables) Returns a string containing formatted data based on
format; variables is the data you want placed in the
string
strftime(format, timestamp) Returns a formatted date or time string based on
format; timestamp is the
time returned by the systime() function
sub(reg, string, target) Substitutes string in target string the first time the
regular expression reg is matched
substr(string, position, len) Returns a substring beginning at position for len
number of characters
tolower(string) Returns the characters in string as their lowercase
equivalent
toupper(string) Returns the characters in string as their uppercase
equivalent

The gsub(reg, string, target) function allows you to globally substitute one set of characters for another (defined in the form of the regular expression reg) within string. The number of substitutions is returned by the function. If target is omitted, the input record, $0, is the target. This is patterned after the substitute command in the ed text editor.

The index(search, string) function returns the first position (counting from the left) of the search string within string. If string is omitted, 0 is returned.

The length(string) function returns a count of the number of characters in string. awk keeps track of the length of strings internally.

Previous | Table of Contents | Next