Awk

Table of Contents

1. AWK 简介

AWK is an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.

AWK was created at Bell Labs in the 1970s, and its name is derived from the family names of its authors – Alfred Aho, Peter Weinberger, and Brian Kernighan. The acronym is pronounced the same as the name of the bird, auk (which acts as an emblem of the language such as on The AWK Programming Language book cover).

Reference:
AWK Wikipedia: http://en.wikipedia.org/wiki/AWK
The GNU Awk User’s Guide: https://www.gnu.org/software/gawk/manual/
The AWK Programming Language

2. AWK 基本格式

awk [-Fs] 'program' optional list of filenames
awk [-Fs] -f programfile optional list of filenames

The option -Fs sets the field separator variable FS to s. If there are no filenames, the standard input is read.

3. AWK programs

An awk program is a sequence of pattern-action statements and function definitions.
A pattern-action statement has the form:

pattern { action }

An omitted pattern matches all input lines; an omitted action prints a matched line.

3.1. 基本处理流程

Every input line is tested against each of the patterns in turn. For each pattern that matches, the corresponding action (which may involve multiple steps) is performed. Then the next line is read and the matching starts over. This continues until all the input has been read.

3.2. Patterns

Awk 支持的模式如表 1 所示。

Table 1: Summary of patterns in AWK
patterns description
BEGIN The actions are executed once before any input has been read.
END The actions are executed once after all input has been read.
expression The actions are executed at each input line where the expression is true, that is, nonzero or nonnull.
/regexpr/ Matches when the current input line contains a substring matched by regexpr. It is same as $0 ~ /regexpr/
expression ~ /regexpr/ Matches if the string value of expression contains a substring matched by regexpr.
expression !~ /regexpr/ Matches if the string value of expression does not contain a substring matched by regexpr.
compound pattern A compound pattern combines expressions with && (AND), || (OR), ! (NOT), and parentheses; the actions are executed at each input line where the compound pattern is true.
pattern 1, pattern 2 A range pattern matches each input line from a line matched by pattern 1 to the next line matched

实例:若某列符合指定模式(第 1 列为 abc 时),则输出整行。

$ cat file1
abc 123
abcd 456
abc 789
$ awk '$1 ~ /^abc$/ {print $0}' file1
abc 123
abc 789

默认地,awk 以空格和 tab 来对每一行进行切分。 $1, $2, $3 等分别表示一行的第 1、2、3 等列,而 $0 则表示整行。

3.3. Actions

An action is a sequence of statements of the following kinds:

break
continue
delete array-element
do statement while (expression)
exit [expression]
expression
if (expression) statement [else statement]
input-output statement
for (expression; expression; expression) statement
for (variable in array) statement
next
return [expression]
while (expression) statement
{ statements }

3.3.1. 变量名规则

The names of user-defined variables are sequences of letters, digits, and underscores that do not begin with a digit; all built-in variables have upper-case names.

注:变量名中只能包含字母、数字、下划线,不能包含连字符等其它字符。

变量,在引用时直接使用,不要在变量名前面加$,如下例中不能写成 print $var1; print $var2。

$ echo "hello world" | awk '{var1=$2; var2=$1; print var1; print var2}'
world
hello

3.3.2. break, continue, next

break
immediately leave innermost enclosing while, for or do
continue
start next iteration of innermost enclosing while, for or do
next
start next iterc1tion of main input loop

3.3.3. print statement

The print statement has two forms:

print expr1, expr2, ... , exprn
print(expr1, expr2, ... , exprn)

Both forms print the string value of each expression separated by the output field separator (OFS) followed by the output record separator (ORS).

The statement print is an abbreviation for print $0.

3.4. Comments

Comments may be inserted at the end of any line. A comment starts with the character # and finishes at the end of the line.

4. Array in AWK

In Awk, arrays are associative, i.e. an array contains multiple index/value pairs.

Arrays and array elements need not be declared, nor is there any need to specify how many elements an array has.

Syntax:
arrayname[string]=value
• arrayname is the name of the array.
• string is the index of an array.
• value is any value assigning to the element of the array.
$ cat array-assign.awk
BEGIN {
    item[101]="HD Camcorder";
    item[102]="Refrigerator";
    item[103]="MP3 Player";
    item["1001"]="Tennis Ball";
    item[55]="Laptop";
    item["na"]="Not Available";
    print item["101"];
    print item[102];
    print item["103"];
    print item[1001];
    print item[55];
    print item["na"];
}

$ awk -f array-assign.awk
HD Camcorder
Refrigerator
MP3 Player
Tennis Ball
Laptop
Not Available

Note 1: From awk's point of view, the index of the array is always a string. Even when you pass a number for the index, awk will treat it as string index. Both of the following are the same.

item[101]="HD Camcorder"
item["101"]="HD Camcorder"

Note 2: Array indexes are not in sequence. It didn't even have to start from 0 or 1. It really started from 101 .. 103, then jumped to 1001, then came down to 55, then it had a string index "na".

Reference: Sed and Awk 101 Hacks

4.1. Test element

You can check whether a particular array index "indexname" exists by using the following if condition syntax. This will return true, if the index "indexname" exists in the array.

if ( "indexname" in arrayname ) ...

注意:不要用下面的语句来测试是否在数组 arrayname 中存在元素 arrayname["indexname"],因为它有副作用:不存在时会创建元素 arrayname["indexname"]。

if ( arrayname["indexname"] != "" ) ...

4.2. Scanning all elements

If you want to access all the array elements, you can use a special instance of the for loop to go through all the indexes of an array:

for (var in arrayname)
    print arrayname[var]

var can be any variable, which holds the index of array.

4.3. Delete element

An array element may be deleted with delete arrayname[subscript]

For example, this loop removes all the elements from the array arrayname:

for (i in arrayname)
    delete arrayname[i]

还有一种更简单的办法删除数组里所有元素,但是古老的 awk 实现可能不支持。
All the elements of an array may be deleted with a single statement by leaving off the subscript in the delete statement, as follows:

delete arrayname

For many years, using delete without a subscript was a common extension. In September 2012, it was accepted for inclusion into the POSIX standard. See the Austin Group website.

还有第三种删除数组里所有元素的方法(可移植性好,但不直观):
The following statement provides a portable but nonobvious way to clear out an array:

split("", arrayname)

4.4. Sort array

GNU awk contains two array sorting functions:

asort(source [, dest [, how ] ])
asorti(source [, dest [, how ] ])

Both functions return the number of elements in the array source. For asort(), gawk sorts the values of source and replaces the indices of the sorted values of source with sequential integers starting with one. If the optional array dest is specified, then source is duplicated into dest. dest is then sorted, leaving the indices of source unchanged.

For example, if the contents of array a are as follows:

a["last"] = "de"
a["first"] = "sac"
a["middle"] = "cul"

Calling 'asort(a)' would yield:

a[1] = "cul"
a[2] = "de"
a[3] = "sac"

The asorti() function works similarly to asort(); however, the indices are sorted, instead of the values. Thus, in the previous example, starting with the same initial set of indices and values in a, calling 'asorti(a)' would yield:

a[1] = "first"
a[2] = "last"
a[3] = "middle"

参考:
https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
https://www.gnu.org/software/gawk/manual/gawk.html#Array-Sorting-Functions

4.5. Multidimensional array

A multidimensional array is an array in which an element is identified by a sequence of indices instead of a single index.

$ cat array-multi3.awk
BEGIN {
  item["1,1"]=10;
  item["1,2"]=20;
  item[2,1]=30;
  item[2,2]=40;
  for (x in item)
    print "Index",x,"contains",item[x];
}

$ awk -f array-multi3.awk
Index 1,1 contains 10
Index 1,2 contains 20
Index 2X1 contains 30  # X is non-printable character "\034"
Index 2X2 contains 40  # X is non-printable character "\034"

In above example.
Indexes "1,1" and "1,2" are enclosed in quotes. So, this is treated as a one dimensional array index, no subscript separator is used by awk. So, the index gets printed as is.
Indexes 2,1 and 2,2 are not enclosed in quotes. So, this is treated as a multi-dimensional array index, and awk uses a subscript separator. So, the index is "2\0341" and "2\0342", which is printed with the non-printable character "\034" between the subscripts.

Reference: Sed and Awk 101 Hacks

4.5.1. SUBSEP - Subscript Separator

The default value of SUBSEP is the string "\034", which contains a nonprinting character that is unlikely to appear in an awk program or in most input data. The usefulness of choosing an unlikely character ("\034") comes from the fact that index values that contain a string matching SUBSEP can lead to combined strings that are ambiguous. Suppose that SUBSEP is "@"; then 'foo["a@b", "c"]' and 'foo["a", "b@c"]' are indistinguishable because both are actually stored as 'foo["a@b@c"]'.

You can change the default subscript separator to anything you like using the SUBSEP variable.

$ cat array-multi5.awk
BEGIN {
  SUBSEP=":";            # change subscript separator to ":".
  item[1,1]=10;
  item[1,2]=20;
  item[2,1]=30;
  item[2,2]=40;
  for (x in item)
    print "Index",x,"contains",item[x];
}

$ awk -f array-multi5.awk
Index 1:1 contains 10
Index 1:2 contains 20
Index 2:1 contains 30
Index 2:2 contains 40

5. Built-in variables

Awk 支持的内置变量如表 2 所示。

Table 2: Built-in variables in AWK
Variable Description Default
ARGC number of command-line arguments -
ARGV array of command-line arguments (ARGV[0 .. ARGC-1 ]) -
FILENAME name of current input file -
FNR input record number in current file -
FS input field separator " "
NF number of fields in current input record -
NR input record number since beginning -
OFMT output format for numbers "%. 6g"
OFS output field separator " "
ORS output record separator "\n"
RLENGTH length of string matched by regular expression in match -
RS input record separator "\n"
RSTART beginning position of string matched by match -
SUBSEP separator for array subscripts of form [i,j, ... ] "\034" (non-printing character)

The current input record is named $0. The fields in the current input record are named $1, $2, ... , $NF.

5.1. ARGC, ARGV

ARGC and ARGV include the name of the invoking program (usually awk) but not the program arguments or options.

For example, with the command line

awk -f progfile a v=1 b

ARGC has the value 4, ARGV[0] contains awk, ARGV[1] contains a, ARGV[2] contains v=1, and ARGV[3] contains b.

Another example, with the command line

awk -F'\t' 'S3 > 100' countries

ARGC is 2, ARGV[0] is awk, ARGV[1] is countries.

6. Functions

awk 中内置函数可参考:GNU Awk Built-in Functions

6.1. Built-in String Functions

3 是 awk 中支持的字符串处理函数。

Table 3: Built-in string functions
Function Description GNU awk extension
asort(source [, dest [, how]]) Array sorting. Sorts the values of source and replaces the indices of the sorted values of source with sequential integers starting with one. If the optional array dest is specified, then source is duplicated into dest. dest is then sorted, leaving the indices of source unchanged. Y
asorti(source [, dest [, how]]) Array sorting. The asorti() function works similarly to asort(); however, the indices are sorted, instead of the values. Y
gensub(regexp, replacement, how [, target]) General substitution. Search the target string target for matches of the regular expression regexp. If how is a string beginning with 'g' or 'G' (short for "global"), then replace all matches of regexp with replacement. Otherwise, how is treated as a number indicating which match of regexp to replace. If no target is supplied, use $0. It returns the modified string as the result of the function and the original target string is not changed. Y
gsub(regexp, replacement [, target]) Globally substitution. If no target is supplied, use $0. Return number of substitutions made. -
index(s, t) Return first position of string t in s, or 0 if t is not present. -
length(s) Return number of characters in s. -
match(s, r) Test whether s contains a substring matched by r, return index or 0; sets RSTART and RLENGTH. -
split(s, a [, fs]) Splits into array a on field separator fs, return number of fields. If there is no third argument, FS is used.  
sprintf(fmt, expr-list) Return expr -list formatted according to format string fmt. -
strtonum(string) Examine str and return its numeric value. Y
sub(r, s [, t]) Substitute s for the leftmost longest substring of t matched by r, return number of substitutions made. If no t is supplied, use $0. -
substr(s, p [, n]) Return substring of s of length n starting at position p. If length n is not present, returns the whole suffix of s that begins at position p. -
tolower(string) Return a copy of string, with each uppercase character in the string replaced with its corresponding lowercase character. Y
toupper(string) Return a copy of string, with each lowercase character in the string replaced with its corresponding uppercase character. Y

说明,awk 中开始位置是 1(而不是 0),如:

$ echo "abc" | awk '{print index($0, "a")}'
1
$ echo "abc" | awk '{print index($0, "b")}'
2

6.2. getline

使用内置函数 getline 可以显式地控制 awk 的读取过程。getline 成功时返回 1,文件结束时返回 0,出错时返回-1。

6.2.1. 实例:仅打印两个模式之间的行

假设有下面数据:

abc
123

==>
hello
world
<==


xyz
==>
awk is good
bad
awk is fantastic
<==

我们想要打印“==>”和“<==”之间的内容,但不要打印内容为“bad”的行。也就是想要打印:

hello
world
awk is good
awk is fantastic

下面 awk 脚本可以完成这个任务:

{
    if ($0 == "==>") {
       while ((getline tmp) > 0 ) {
            if (tmp == "<==") {
                break;
            }
            if (tmp != "bad") {
               print tmp
            }
        }
    }
}

6.2.2. 实例:删除 C 语言风格的注释

下面 awk 脚本可以删除 C 语言风格的注释:

# Remove text between /* and */, inclusive
{
    if ((i = index($0, "/*")) != 0) {
        out = substr($0, 1, i - 1)  # leading part of the string
        rest = substr($0, i + 2)    # ... */ ...
        j = index(rest, "*/")       # is */ in trailing part?
        if (j > 0) {
            rest = substr(rest, j + 2)  # remove comment
        } else {
            while (j == 0) {
                # get more text
                if (getline <= 0) {
                    print("unexpected EOF or error:", ERRNO) > "/dev/stderr"
                    exit
                }
                # build up the line using string concatenation
                rest = rest $0
                j = index(rest, "*/")   # is */ in trailing part?
                if (j != 0) {
                    rest = substr(rest, j + 2)
                    break
                }
            }
        }
        # build up the output line using string concatenation
        $0 = out rest
    }
    print $0
}

6.3. User-defined Functions

使用关键字 function 可以自定义函数。如:

# Returns minimum number
function min(num1, num2){
    return num1 < num2 ? num1 : num2
}

BEGIN {
    print(min(10, 20))
}

6.3.1. 函数中变量是全局的(函数参数除外)

在 awk 中函数中变量是全局的,不过,函数参数是局部的。如:

function foo(j)              # 不推荐这样写函数!后文有解决方案。
{
    j = j + 1                # j是函数参数,它是局部的,不会影响全局变量j
    i = j                    # 小心:会修改全局变量i
}

BEGIN {
    i = 10
    j = 20
    print "top's i=" i
    print "top's j=" j
    foo(100)
    print "top's i=" i
    print "top's j=" j
}

上面代码会输出(全局变量 i 被函数 foo 修改了):

top's i=10
top's j=20
top's i=101
top's j=20

怎样才能让函数中的变量(如上面函数 foo 中的变量 i)只影响当前函数(即成为局部变量)呢?在 awk 中提供的机制是把局部变量声明在参数列表中!如:

function foo(j,      i)      # 局部变量需要写在参数列表中,它们和真正参数之间用多个空格隔开!(这是awk作者建议的“约定”)
{
    j = j + 1                # j在参数列表中,它是局部的
    i = j                    # i在参数列表中,它是局部的
}

BEGIN {
    i = 10
    j = 20
    print "top's i=" i
    print "top's j=" j
    foo(100)
    print "top's i=" i
    print "top's j=" j
}

上面代码会输出(全局变量 i 不会被函数 foo 修改了):

top's i=10
top's j=20
top's i=10
top's j=20

7. AWK 中的正则表达式

The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs).

7.1. Dynamic Regexp (让正则匹配符~右边可使用变量)

首先,通过一个简单例子回顾下 regexp constant (i.e., a string of characters between slashes)的用法:

$ cat file1
123
abcd
xyz
$ awk '$0 ~ /^[[:digit:]]+$/ { print }' file1         # 这里, /^abc$/ 是regexp constant(用两个斜杠包围)
123

有时,我们想在正则匹配符~右边使用变量,这就是 Dynamic Regexp(也称为 Computed Regexp)。如,上面例子也可以写为 Dynamic Regexp 的形式:

$ awk 'BEGIN { digits_regexp = "^[[:digit:]]+$" } $0 ~ digits_regexp { print }' file1   # Dynamic Regexp实例
123
$ awk -v regexp='^[[:digit:]]+$' '$0 ~ regexp { print }' file1                          # Dynamic Regexp实例(同上)
123

需要注意的是, Dynamic Regexp 不能用两个斜杠包围(两个斜杠包围是 regexp constant 的记法)。 如,下面是错误的用法:

$ awk -v regexp='/^[[:digit:]]+$/' '$0 ~ regexp { print }' file1             # 错误用法!正确用法为regexp='^[[:digit:]]+$'
awk: syntax error in regular expression /^[[:digit:]]+$/ at [[:digit:]]+$/
 input record number 1, file file1
 source line number 1

参考:Using Dynamic Regexps

8. Useful "One-liners"

Following useful "One-liners" come from book: "The AWK Programming Language".

Print the total number of input lines:

END { print NR }

Print the tenth input line:

NR == 10

Print the last field of every input line:

{ print $NF }

Print the last field of the last input line:

{ field = $NF }
END { print field }

Print every input line with more than four fields:

NF > 4

Print every input line in which the last field is more than 4:

$NF > 4

Print the total number of fields in all input lines:

{ nf = nf + NF }
END { print nf }

Print the total number of lines that contain Beth:

/Beth/ { nlines = nlines + 1 }
END { print nlines }

Print the largest first field and the line that contains it (assumes some $1 is positive):

$1 > max { max = $1; maxline = $0 }
END { print max, maxline }

Print every line that has at least one field:

NF > 0

Print every line longer than 80 characters:

length($0) > 80

Print the number of fields in every line followed by the line itself:

{ print NF, $0 }

Print the first two fields, in opposite order, of every line:

{ print $2, $1 }

Exchange the first two fields of every line and then print the line:

{ temp = $1; $1 = $2; $2 = temp; print }

Print every line with the first field replaced by the line number:

{ $1 = NR; print }

Print every line after erasing the second field:

{ $2 = ""; print }

Print in reverse order the fields of every line:

{ for (i = NF; i > 0; i = i - 1) printf("%s ", $i)
  printf ("\n")
}

Print the sums of the fields of every line:

{ sum= 0
  for (i = 1; i <= NF; i = i + 1) sum = sum + $i
  print sum
}

Add up all fields in all lines and print the sum:

{ for (i = 1; i <= NF; i = i + 1) sum = sum+ $i }
END { print sum }

Print every line after replacing each field by its absolute value:

{ for (i = 1; i <= NF; i = i + 1) if ($i < 0) $i = -$i
  print
}

Author: cig01

Created: <2012-12-08 Sat>

Last updated: <2017-12-13 Wed>

Creator: Emacs 27.1 (Org mode 9.4)