Awk - Description, flags and examples

mail

Awk logical operators

Flags

Operator Usage
&& expr1 && expr2
  • logical AND : true if both expr1 and expr2 are true
  • expr2 is evaluated only if expr1 is true
|| expr1 || expr2
  • logical OR : true if at least one of expr1 or expr2 is true
  • expr2 is evaluated only if expr1 is false
! ! expr
  • logical NOT : true if expr is false
mail

Error : awk: not an option: -i

Situation

I'm trying to execute but I get the error :
awk: not an option: -i

Details

The -i flag belongs to GNU Awk and is not available for the "regular" awk program.

Solution

Install GNU Awk :
apt install gawk
mail

Awk examples

An example is worth one thousand words., so enjoy your reading !

Tutorials & basic examples :

convert sleep durations into human-readable durations (related: numbers with trailing unit letter)

for unit in '' s m; do
	for duration in 0 1 2 10 100; do
		value="$duration$unit"
		echo -n "'$value'."
		awk '
			/^[0-9]+$/	{ if($0 < 2) print $0 " second"; else print $0 " seconds"; }
			/^[0-9]+[ms]$/	{
				number=strtonum($0)
				unit=gensub(/^[0-9]+([ms])/, "\\1", "g");

				switch (unit) {
					case /m/:
						longUnit="minute"
						break
					case /s/:
						longUnit="second"
						break
					}
				 if(number > 1) print number" "longUnit"s"; else print number" "longUnit   ;
				 }
			' <<< "$value"
	done
done | column -s '.' -t
'0'     0 second
'1'     1 second
'2'     2 seconds
'10'    10 seconds
'100'   100 seconds
'0s'    0 second
'1s'    1 second
'2s'    2 seconds
'10s'   10 seconds
'100s'  100 seconds
'0m'    0 minute
'1m'    1 minute
'2m'    2 minutes
'10m'   10 minutes
'100m'  100 minutes

a dummy script with gensub, if else, length and logical operators

cat << EOF | awk '
BEGIN	{ followingWord=""; myVariable=""; }
/FOO/	{ followingWord=gensub(/.*FOO ([^ ]+).*/, "\\1", "g") }
/BAR/	{ myVariable="A" }
/BAZ/	{ myVariable="B" }
		{ if((length(followingWord) > 0) && (length(myVariable) > 0)) {
			print followingWord" ("length(followingWord)") "myVariable" ("length(myVariable)")."
			followingWord=""; myVariable="";
			}
		}
'
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut FOO labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex BAR ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore BAZ eu fugiat nulla pariatur.
EOF
labore (6) A (1).

List "enabled" repositories :

The URL linked below lists repositories :
[docker-ce-stable]
name=Docker CE Stable - $basearch
baseurl=https://download.docker.com/linux/centos/$releasever/$basearch/stable
enabled=1
gpgcheck=1
gpgkey=https://download.docker.com/linux/centos/gpg

[docker-ce-stable-source]
name=Docker CE Stable - Sources
baseurl=https://download.docker.com/linux/centos/$releasever/source/stable
enabled=0
gpgcheck=1
gpgkey=https://download.docker.com/linux/centos/gpg

I want to list the IDs of those having enabled=1 :
curl -s https://download.docker.com/linux/centos/docker-ce.repo | awk ' \
	BEGIN		{ repoId=""; } \
	/^ *$/		{ repoId=""; } \
	/^\[.*\]/	{ repoId=$0; } \
	/^enabled=1/	{ if (repoId != "") { result=gensub(/[\[\]]/, "", "g", repoId); print result; } }'
docker-ce-stable

Find who's the link / who's the target in a symlink :

The snippet below serves just to play with Awk since it can be replaced by the much more efficient command readlink.
while read line; do
	echo -e "\t$line"
	echo "$line" | awk '{ link=$1; target=""; for(i=2; i<=NF; i++) { if($i!="->") link=link" "$i; else break; } for(j=i+1; j<NF; j++) target=target""$j" "; target=target""$NF; print " LINK: \047"link"\047, TARGET: \047"target"\047"; }'
	echo
done < <(find -type l -exec ls -l {} + | awk '{ for (i=1; i<9; i++) $i=""; print $0; }')
Explanations :
find -type l -exec ls -l {} + | awk '{ for (i=1; i<9; i++) $i=""; print $0; }'
turn this :
lrwxrwxrwx 1 kevin developers 57 Apr 5 11:03 'path/to/link' -> 'path/to/target'
into this :
path/to/link -> path/to/target
i.e. remove the metadata shown by ls -l, which is made of the 8 first line fields.
\047
code to let Awk's print display single quotes ' (see comments of this answer)
mail

Make music with awk

Run this script :
#!/usr/bin/env bash

value1=120
# initial value : 160
# tested values : 200, 140, 100
#	decreasing from the initial value : more bass sounds

value2=0.5678
# initial value : 0.87055
# tested values : 0.747, 0.777, 0.789
#	increasing values above 3.xxx : extreme bass sounds (?), hardly audible
#	around 0.5xxxxx : nice chime sounds

value3=13
# initial value : 10
# tested values : 13, 17, 26
#	increasing values : more high-pitched sounds
#	26 makes some 'D2-R2' blips

value4=128	# no effect so far :-(
# initial value : 128

awk "function wl() {
		rate=64000;
		return (rate/$value1)*($value2^(int(rand()*$value3)))};
	BEGIN {
		srand();
		wla=wl();
		while(1) {
			wlb=wla;
			wla=wl();
			if (wla==wlb)
				{wla*=2;};
			d=(rand()*10+5)*rate/4;
			a=b=0; c=$value4;
			ca=40/wla; cb=20/wlb;
			de=rate/10; di=0;
			for (i=0;i<d;i++) {
				a++; b++; di++; c+=ca+cb;
				if (a>wla)
					{a=0; ca*=-1};
				if (b>wlb)
					{b=0; cb*=-1};
				if (di>de)
					{di=0; ca*=0.9; cb*=0.9};
				printf(\"%c\",c)};
			c=int(c);
			while(c!=$value4) {
				c<$value4?c++:c--;
				printf(\"%c\",c)};};}" | aplay -r 64000
mail

Awk internal keywords

BEGIN
a BEGIN rule is executed once only, before the first input record is read (example)
BEGINFILE
see ENDFILE
END
an END rule is executed once only, after all the input is read (example)
ENDFILE
This is a gawk extension. The ENDFILE rule :
  • is called when gawk has finished processing the last record in an input file. For the last input file, it will be called before any END rules
  • is executed even for empty input files
  • allows to catch errors
mail

awk vs gawk

The standard Debian setup comes with /usr/bin/awk (don't know where this one comes from, awk ?), which has basic / limited functionality : Once gawk is installed :
ls -l $(which awk)
lrwxrwxrwx 1 root root 21 Oct 11 2016 /usr/bin/awk -> /etc/alternatives/awk*
md5sum /etc/alternatives/awk $(which gawk)
23a5b5a3d9ba0d2c6277dbdaf2557033	/etc/alternatives/awk
23a5b5a3d9ba0d2c6277dbdaf2557033	/usr/bin/gawk

Once gawk is installed, it can be invoked with awk.

mail

Awk internal variables

$n
the nth element of the current line ($0 being the whole line itself) :
for i in {1..4}; do echo 'a b c d' | awk '{print "Item '$i' of line \""$0"\" is "$'$i'"."}'; done
FILENAME
  • name of the current input file
  • - when reading from standard input
  • empty string inside a BEGIN rule
FS
Field Separator. Can be set with -F
NF
  • number of fields in the current line :
    echo 'a b c' | awk '{print NF}'; echo 'joe jack william averell' | awk '{print NF}';
    3
    4
  • It is often used to refer to the last field of a line :
    echo 'a b c' | awk '{print $NF}'; echo 'joe jack william averell' | awk '{print $NF}';
    c
    averell
NR
number of records processed so far (which can be approximated to the number of the current row, starting at 1) :
for i in {a..e}; do echo $i; done | awk '{ print "line "NR":\t" $0}'
line 1: a
line 2: b
line 3: c
line 4: d
line 5: e
OFS
Output Field Separator. It is automatically inserted between fields by print. Defaults to a single space.
This is not a CLI flag, it goes into the "action" part :
  • echo {a..z} | awk '{OFS="."; print $1,$3,$5,$7}'
    a.c.e.g
  • No need to repeat the definition for every line of input :
    echo {a..z} | awk 'BEGIN{OFS="PLOP"} {print $1,$3,$5,$7}'
    aPLOPcPLOPePLOPg
RS
Records Separator
  • defaults to \n (NEWLINE) : by default Awk considers 1 record == 1 line of input
  • gawk also accepts regular expression
mail

Tailor files / strings / substrings with Awk / Bash / PERL / sed

awk

Extract specific fields from log files :

  • awk '$9 == "searchedKeyword" {print $7}' file.log | sort | uniq -c | sort -nr | head -n 10
  • awk '$6 ~ "30." {print $5" "$6}' file | ...

~ is the Awk operator to match a regular expression.

Bash (source)

Replace a substring :

  • ${string/substring/replacement} : replace 1st occurrence
  • ${string//substring/replacement} : replace all occurrences
  • myString='Hello World'; echo ${myString//[eo]/ab} : outputs Habllab Wabrld

Test whether a string matches a RegExp (source) :

testString='Hello World'; if [[ $testString =~ ^.*o.*o.*$ ]]; then echo "MATCHES"; else echo "DOESN'T MATCH"; fi

PERL

Apply a regExp to a string :

  • perl -e '$ARGV[0]=~ m/..(.)/; print $1' abcdef
  • echo AZERqsdfWXCV | xargs perl -e '$ARGV[0]=~ m/.{4}(.{4}).*(.)$/; print "$1 $2"'

sed

Extract (in CSV format) URL + hit/miss + generation time from a Varnish log :

sed -r 's/.*GET ([^ ]*).*(hit|miss) ([0-9.]*).*/\1;\2;\3/' access.log > result.log

Extract (in CSV format) URL + HTTP error code from Lighttpd log :

sed -r 's/^.*GET ([^ ]*).*HTTP\/1\.1" ([0-9]*).*$/\1;\2;/' /var/log/lighttpd/www.example.com.log > result.log

Same as above with HTTP 500 errors only + sorting results by descending number of occurrences :

logFile='/var/log/lighttpd/www.example.com.log'; resultFile='./result.csv'; tmpFile=$(mktemp --tmpdir tmp.result.XXXXXXXX); grep '" 500 ' $logFile | sed -r 's/^.*GET ([^ ]*).*HTTP\/1\.." ([0-9]*).*$/\1;\2;/' > $tmpFile; cat $tmpFile | sort | uniq -c | sort -nr > $resultFile; rm $tmpFile

Using grep 1st because sed can't find a match on every line, as we're reporting only on HTTP 500 errors.

Extract (in CSV format) several fields from Apache logs stored in a year/month/day directory tree :

resultFile='~/result.csv'; tmpFile=$(mktemp --tmpdir tmp.XXXXXXXX); csvHeader='web server;IP;HTTP method;URL used by method;full URL;'; echo $csvHeader > $tmpFile; logFilePath='/path/to/logfiles/'; startYear='2013'; endYear='2013'; startMonth='04'; endMonth='04'; startDay='01'; endDay='18'; for year in $(seq $startYear $endYear); do for month in $(seq $startMonth $endMonth); do for day in $(seq $startDay $endDay); do [ ${#month} -eq 1 ] && month='0'$month; [ ${#day} -eq 1 ] && day='0'$day; logFile=$logFilePath/$year/$month/$day/$year$month$day'-access.log'; echo "PROCESSING $logFile ..."; grep 'example.com' $logFile | grep -v 'GET' | sed -r 's/^.*(webServer(1|2)).* ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+) .*\] "([A-Z]*) (.*) HTTP.*" [0-9]+ [0-9]+ "([^"]+)".*/;\1;\3;\4;\5;\6/' | sort | uniq >> $tmpFile; done; done; done; cat $tmpFile | sort | uniq -c | sort -nr >> $resultFile; rm $tmpFile
mail

awk

Usage

Awk is a programmable text filter. Its input can be :

Output :

An Awk script is made of 3 blocks :

  1. pre-process : BEGIN
  2. process
  3. post-process : END

Awk reads the input line by line, then applies the specified filter(s) to detect whether or not to process the current line. Before starting processing a line, Awk splits it into fields and stores fields values in $1 (1st field), $2, ..., $NF (last field). $0 is the whole input line. The fields separator (specified with FS) defaults either to [SPACE] or [TAB] (details).

There is no need to use grep together with Awk as Awk "RegExp matches" lines to process.

Filters :

Criteria select matching lines select not matching lines
line number within input
awk 'NR==n {doSomething}'
echo -e 'a\nb\nc\nd' | awk 'NR==3'
c
echo -e 'a\nb\nc\nd' | awk 'NR==3 {next}; {print}'
a
b
d
line vs regular expression
awk '/regEx/ {doSomething}'
  • echo -e 'foo\nbar\nbaz' | awk '/bar/ {print $0}'
    bar
  • echo -e 'foo\nPool ID : 1234\nbar\nID du pool : 4321\nbaz' | awk '/(Pool ID|ID du pool)/ {print $NF}'
    1234
    4321
awk '!/regEx/ {doSomething}'
echo -e 'foo\nbar\nbaz' | awk '!/a/ {print $0}'
foo
(source, example)
line vs number of fields echo -e 'field1\nfield1\tfield2\nfield1\tfield2\tfield3' | awk 'NF == 2 {print $0}'
field1	field2
field vs number echo -e 'foo\t12\nbar\t34\nbaz\t56' | awk '$2 > 25 {print $0}'
bar	34
baz	56
Awk is smart enough to strip leading zeroes :
echo {01..10} | awk '$3 > 2 { print "ok" }'
ok
echo {01..10} | awk '$3 > 3 { print "ok" }'
(void)
echo {01..10} | awk '$3 >= 3 { print "ok" }'
ok
echo {0001..10} | awk '$3 >= 3 { print "ok" }'
ok
Trying to filter data based on line numbers returned by grep -n with a construct like :
grep -n --color=always [options] | awk -F ':' '$n > x {doSomething}'
may fail because of the returned color codes.
  • echo -e 'FOO\nBAR\nBAZ' | grep -n --color=always '.A' | awk -F ':' '$1>2 {print $0}'
  • echo -e 'FOO\nBAR\nBAZ' | grep -n '.A' | awk -F ':' '$1>2 {print $0}'
field vs string
awk '$n == "value" {doSomething}'
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 == "bar2" {print $0}'
foo2 bar2 baz2
awk '$n != "value" {doSomething}'
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 != "bar2" {print $0}'
foo1 bar1 baz1
foo3 bar3 baz3
field vs regular expression
awk '$n ~ /regEx/ {doSomething}'
  • for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 ~ /a.1/ {print $0}'
    foo1 bar1 baz1
  • find the shortest path :
    echo -e "bla dir1/\nbla dir1/dir2/\nbla dir1/dir2/dir3/" | awk '$NF ~ /^[^/]*\/$/ {print $NF}'
    dir1/
awk '$n !~ /regEx/ {doSomething}'
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 !~ /a.1/ {print $0}'
foo2 bar2 baz2
foo3 bar3 baz3
field vs regular expression with if / else construct (source) for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '{ if($2 ~ "a.2") {print "MATCH : "$2 } else {print "NO MATCH"} }'
NO MATCH
MATCH : bar2
NO MATCH
several conditions
awk 'condition1 logicalOperator condition2 logicalOperator ... conditionN {doSomething}'
logicalOperator can be (source) :
  • && : logical AND
  • || : logical OR
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 ~ "^ba.." && $3 == "baz3" {print $0}'
foo3 bar3 baz3
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$1 ~ "1$" || $3 ~ "3$" {print $0}'
foo1 bar1 baz1
foo3 bar3 baz3
for i in {6..22}; do echo "a b c d e f g h $i"; done | awk '$NF==7 || $NF==21 {print $0}'
a b c d e f g h 7
a b c d e f g h 21
echo | awk '1==1 && (2==1 || 3==3) { print "ok" }'
ok

Numerical field with trailing unit letter or text (related: convert sleep durations into human-readable durations) :

If the numerical value has a unit letter, it doesn't work anymore :

echo -e "foo\t8U\nbar\t34U\nbaz\t56U" | awk '$2 > 25 {print $0}'
foo	8U	ooops !
bar	34U
baz	56U
solution :
echo -e "foo\t8U\nbar\t34U\nbaz\t56U" | awk 'strtonum($2) > 25 {print $0}'
bar	34U
baz	56U

Try it :

  • df -h | awk 'strtonum($5) > 75 {print $0}'
  • df -h | awk 'BEGIN {gsub(/%/, "", $5)} {if(strtonum($5) > 50) {print $0}}'
strtonum() looks smart enough to handle trailing units (source) :
awk 'BEGIN {
	print "trailing unit (single letter) : " strtonum("123U")
	print "trailing unit (word) : " strtonum("123potatoes")
	print "leading unit (single letter) : " strtonum("Y123")
	print "leading unit (word) : " strtonum("banana123")
	}'
trailing unit (single letter) : 123	OK
trailing unit (word) : 123		OK
leading unit (single letter) : 0	KO
leading unit (word) : 0			KO

Flags

Flag Usage
-F x use x as the input Field separator
  • x can be several characters long : echo 'GAABCDBUABCDZOABCDMEU' | awk -F 'ABCD' '{print $1,$2,$3,$4}'
  • default field separator : any run of spaces and/or tabs and/or newlines (excluding leading and trailing runs) (details)
-i awkLibrary
--include awkLibrary
load the specified library awkLibrary (example)
-v variable=value
  • declare a variable (example)
  • use multiple -v to declare several variables : -v variable1=value1 -v variable2=value2

Exit Status

Example

Process log files :

Count occurrences of an error message in a log file :

This code removes the [10-Oct-2012 18:15:46 UTC] fields from every logfile line. This is why Awk is taught to display all fields starting from the 4th :
awk '/^\[/ { for (i=4;i<=NF;i++) printf $i " ";print ""}' logFile

printf adds no carriage return after printing. print does.

Then count occurrences :
awk '/^\[/ { for (i=4;i<=NF;i++) printf $i " ";print ""}' logFile | sort | uniq -c | sort -nr

From a multiple-fields line, displays fields starting from the 4th :

In a log file such as :
[13-Nov-2013 03:03:35 Europe/Paris] PHP Warning: Memcached::touch(): ... in ....php on line 45
[13-Nov-2013 03:04:42 Europe/Paris] PHP Warning: file_get_contents(http://...): HTTP/1.0 404 Not Found in ...php on line 202
...
let's say you'd like to remove the date/time field to group and count similar errors. To do so :

awk '{ for (i=1;i<=3;i++) $i="";print }' file.log | awk '{sub(/^[ \t]+/, ""); print}' | sort | uniq -c | sort -nr

  • the 1st Awk command replaces the first 3 fields with an empty string, so that the line only contains the remaining fields, starting from the 4th as required
  • the 2nd Awk command just removes leading whitespaces (source)

Match a keyword from a variable :

You can't use a variable name within the // operator to select the matching line :
  • DON'T : echo -e 'apple\nbanana\ncarrot' | awk -v letterToMatch='b' '/letterToMatch/'
  • DO : echo -e 'apple\nbanana\ncarrot' | awk -v letterToMatch='b' '$1 ~ letterToMatch'
for httpCode in 301 302 304; do echo -n "Code $httpCode : "; awk -v needle="$httpCode" '$6 ~ needle {print " "}' logFile | wc -l; done

Examples to illustrate the above :

  • echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '/stuffToMatch/'
    (nothing)
    echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$0 ~ /stuffToMatch/'
    (nothing)
    no match found, as said above
  • echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' 'stuffToMatch'
    fruit: apple
    fruit: banana
    vegetable: carrot
    matches everything
  • echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$0 ~ stuffToMatch'
    fruit: banana
    vegetable: carrot
    echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$1 ~ stuffToMatch'
    vegetable: carrot
    echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$2 ~ stuffToMatch'
    fruit: banana
    All examples work as expected

Quotes or not around the value to match ?

  • match the string "a"
    echo -e '1a\n2b\n3a\n4b' | awk '$0 ~ "a"'
    1a
    3a
  • match the unset variable a
    echo -e '1a\n2b\n3a\n4b' | awk '$0 ~ a'
    1a
    2b
    3a
    4b
  • same as above, but explicit :
    echo -e '1a\n2b\n3a\n4b' | awk -v a='' '$0 ~ a'
    1a
    2b
    3a
    4b
  • now setting the variable :
    echo -e '1a\n2b\n3a\n4b' | awk -v a='a' '$0 ~ a'
    1a
    3a

When regex-matching a string with a variable :

  • run this pseudo-code :
    if input =~ "foo.=A"
    	continue
    else
    	print input
    echo -e 'foo1=A\nfoo2=B\nfoo3=C' | awk -v value='A' '$0 ~ "foo.="value {next} {print}'
  • other example :
    for i in {A..C}; do for j in {A..C}; do echo "$i$j"; done; done | awk -v value='A' '$0 ~ "("value"|B)C"'
    AC
    BC

Selecting PID's :

ps --ppid 1 | awk '/d$/ {print $1}'
Lists processes whose parent's PID is 1, then selects processes whose name ends in 'd', and prints the corresponding PID, which is the line field #1.

Specifying the field separator :

awk -F ':' '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd

List all ports and PIDs on which a Mongodb instance is listening :

netstat -laputen | awk '/mongo/ {print "IP:port = "$4"\tPID = "$9}' | sort | uniq

Select non-empty lines :

echo -e 'A\tB\tC\tD\nE\tF\tG\tH\n\nI\tJ\tK\tL' | awk '!/^$/ {print $3}'

C
G
K

Dark wizardry ?

This awk command made me scratch my head quite a bit : it returns fields from 2 distinct lines, that even are not contiguous ().
To figure this out, I simplified it, and let the magic happen :
echo -e 'key1\tvalue1\nkey2\tvalue2' | awk '/key1|key2/ { printf $2 " " }'
value1 value2

Explanation :

  • the input (either an echo, a line "piped" in, or a whole file) is perfectly "normal" : there is no hack regarding field separators or end of line markers.
  • the /key1|key2/ part of the awk command is a "normal" regular expression alternation
  • the printf $2 " " part simply prints the 2nd field of each matching line, followed by a space
So what's the trick ?
Let's have a deeper look at how awk works and what we're instructing it to do with the echo | awk command above :
  1. no pre-process, so let's start eating lines and doing things
  2. awk splits the input into distinct lines
  3. awk reads the 1st line : key1\tvalue1
  4. does it match the regular expression ? Yes, so print the 2nd field and a space character : value1 
  5. done with this line, continue with the next line
  6. read the 2nd line : key2\tvalue2
  7. does it match the regular expression ? Yes again, so print the 2nd field and a space character : value2 
  8. The trick is that awk does not automatically add a newline character after printing. So the output of any step is printed right after the output of the previous step. This is why, at this step of the procedure, the output looks like : value1 value2 
  9. done with this line, no next line
  10. no post-process
  11. the end !