Awk - Description, flags and examples

mail

next

immediately stop processing the current record and go on to the next one
mail

exit

see examples : 1, 2
mail

break

mail

if else

checkPageExists() {
	local page=$1
	curl -sI "$page" | awk '/^HTTP\/1.1/ {
		if ($2=="200")
			exit 0
		else
			exit 1
		 }'
	}

main() {
	
	for page in $pageList; do
		checkPageExists "$page" || continue
		
	done
	}
mail

Awk examples

An example is worth one thousand words., so enjoy your reading !

Tutorials & basic examples :

List "enabled" repositories :

The URL linked below lists repositories :
[docker-ce-stable]
name=Docker CE Stable - $basearch
baseurl=https://download.docker.com/linux/centos/$releasever/$basearch/stable
enabled=1
gpgcheck=1
gpgkey=https://download.docker.com/linux/centos/gpg

[docker-ce-stable-source]
name=Docker CE Stable - Sources
baseurl=https://download.docker.com/linux/centos/$releasever/source/stable
enabled=0
gpgcheck=1
gpgkey=https://download.docker.com/linux/centos/gpg

I want to list the IDs of those having enabled=1 :
curl -s https://download.docker.com/linux/centos/docker-ce.repo | awk ' \
	BEGIN		{ repoId=""; } \
	/^ *$/		{ repoId=""; } \
	/^\[.*\]/	{ repoId=$0; } \
	/^enabled=1/	{ if (repoId!="") { result=gensub(/[\[\]]/, "", "g", repoId); print result; } }'
docker-ce-stable

Find who's the link / who's the target in a symlink :

The snippet below serves just to play with Awk since it can be replaced by the much more efficient command readlink.
while read line; do
	echo -e "\t$line"
	echo "$line" | awk '{ link=$1; target=""; for(i=2; i<=NF; i++) { if($i!="->") link=link" "$i; else break; } for(j=i+1; j<NF; j++) target=target""$j" "; target=target""$NF; print " LINK: \047"link"\047, TARGET: \047"target"\047"; }'
	echo
done < <(find -type l -exec ls -l {} + | awk '{ for (i=1; i<9; i++) $i=""; print $0; }')
Explanations :
find -type l -exec ls -l {} + | awk '{ for (i=1; i<9; i++) $i=""; print $0; }'
turn this :
lrwxrwxrwx 1 kevin developers 57 Apr 5 11:03 'path/to/link' -> 'path/to/target'
into this :
path/to/link -> path/to/target
i.e. remove the metadata shown by ls -l, which is made of the 8 first line fields.
\047
code to let Awk's print display single quotes ' (see comments of this answer)
mail

print

mail

print, printf, sprintf

print :

  • with no argument, print the whole input line :
    echo -e 'line 1\nline 2\nline 3' | awk '{print}'
    line 1
    line 2
    line 3
  • with 1 argument, print it :
    echo -e 'line 1\nline 2\nline 3' | awk '{print $2}'
    1
    2
    3
  • with more than 1 argument :
    • when the arguments are separated by commas : print arguments separated by SPACE (default) or the specified OFS
    • when the arguments are separated by spaces : print arguments concatenated
    echo -e 'line 1\nline 2\nline 3' | awk '{print $2,$1}'; echo -e 'line 1\nline 2\nline 3' | awk '{print $2 $1}'
    1 line
    2 line
    3 line
    1line
    2line
    3line
  • Appends a carriage return \n to the output (printf doesn't) :
    echo | awk '{print "Hello world"}'; echo | awk '{printf "Hello world"}'

printf :

  • Doesn't add a trailing \n
  • Supports the C-style printf(string, expression list) syntax :
    echo | awk '{printf("%d is The Answer to The Great Question.", 42)}'

sprintf :

Returns without printing, what printf would have printed out with the same arguments.
mail

switch case

Here's a very basic example (not-so-perfect but you'll get the idea ) :
echo 'abc' | awk '{
	switch ($0) {
		case /[[:lower:]]+/:
			print "lowercase"
			break
		case /[[:upper:]]+/:
			print "uppercase"
			break
		}
	}'
As a one-liner that can be pasted into the shell :
echo 'abc' | awk '{ switch ($0) { case /[[:lower:]]+/: print "lowercase"; break; case /[[:upper:]]+/: print "uppercase"; break; } }'
mail

Make music with awk

Run this script :
#!/usr/bin/env bash

value1=120
# initial value : 160
# tested values : 200, 140, 100
#	decreasing from the initial value : more bass sounds

value2=0.5678
# initial value : 0.87055
# tested values : 0.747, 0.777, 0.789
#	increasing values above 3.xxx : extreme bass sounds (?), hardly audible
#	around 0.5xxxxx : nice chime sounds

value3=13
# initial value : 10
# tested values : 13, 17, 26
#	increasing values : more high-pitched sounds
#	26 makes some 'D2-R2' blips

value4=128	# no effect so far :-(
# initial value : 128

awk "function wl() {
		rate=64000;
		return (rate/$value1)*($value2^(int(rand()*$value3)))};
	BEGIN {
		srand();
		wla=wl();
		while(1) {
			wlb=wla;
			wla=wl();
			if (wla==wlb)
				{wla*=2;};
			d=(rand()*10+5)*rate/4;
			a=b=0; c=$value4;
			ca=40/wla; cb=20/wlb;
			de=rate/10; di=0;
			for (i=0;i<d;i++) {
				a++; b++; di++; c+=ca+cb;
				if (a>wla)
					{a=0; ca*=-1};
				if (b>wlb)
					{b=0; cb*=-1};
				if (di>de)
					{di=0; ca*=0.9; cb*=0.9};
				printf(\"%c\",c)};
			c=int(c);
			while(c!=$value4) {
				c<$value4?c++:c--;
				printf(\"%c\",c)};};}" | aplay -r 64000
mail

Awk internal keywords

BEGIN
a BEGIN rule is executed once only, before the first input record is read (example)
BEGINFILE
see ENDFILE
END
an END rule is executed once only, after all the input is read (example)
ENDFILE
This is a gawk extension. The ENDFILE rule :
  • is called when gawk has finished processing the last record in an input file. For the last input file, it will be called before any END rules
  • is executed even for empty input files
  • allows to catch errors
mail

system

A basic example

echo -e 'line A\nline B\nline C\nD' | awk '/^line/ { system("echo "$NF) }'
  • this command is absolutely useless : I just needed a dummy working example
  • don't forget that Awk variables must stay outside of quotes

How to store the result of a system command into a variable ? (source)

There are several methods to run a shell command from Awk :

with system(myCommand)

myVariable=system(myCommand) stores the return code of myCommand into myVariable
  • awk 'BEGIN { result1=system("true"); result2=system("false"); print result1" "result2; }'
    0 1

with myCommand | getline

awk 'BEGIN {
	myCommand = "date --iso-8601=seconds"
	myCommand | getline myResultVariable
	close(myCommand)
	print "Current date = "myResultVariable
	}'

Mixing Awk and Bash tests :

[ -e someFile ] && rm someFile; for i in 1 2; do
	echo -e 'hello someFile world' | awk '{ print "\ninput : "$0; if(system("[ -f "$2" ]")) { print $2" exists" } }'
	ls someFile; touch someFile
done; rm someFile
  1. this begins by making sure that the file someFile does not exist
  2. then there's a for loop that runs twice
  3. each run echoes some text to awk via a pipe :
    1. display the input value as-is
    2. make a shell-based test using system
    3. display some text accordingly
  4. ls to confirm someFile exists or not, then it is touched
  5. loop
  6. remove someFile at the very end (Boy scout rule)
input : hello someFile world
someFile exists							mmmkay
ls: cannot access 'someFile': No such file or directory		make up your mind !

input : hello someFile world
someFile							
This is because Awk and Bash disagree on what makes a "success" :
success failure
Awk anything else 0
Bash 0 anything else
  1. when the file exists, [ -f "$2" ] is a Bash success : 0
  2. system returns this code as-is
  3. but for Awk : if(system(Bash success)) is false
    echo | awk '{ if(system("true")) print "ok" }'
    (nothing)
    
    echo | awk '{ if(system("false")) print "ok" }'
    ok						
To make this work, the solution is to negate the result of the Bash test in the original command :
if(system("[ ! -f "$2" ]")) {
mail

awk vs gawk

The standard Debian setup comes with /usr/bin/awk (don't know where this one comes from, awk ?), which has basic / limited functionality : Once gawk is installed :
ls -l $(which awk)
lrwxrwxrwx 1 root root 21 Oct 11 2016 /usr/bin/awk -> /etc/alternatives/awk*
md5sum /etc/alternatives/awk $(which gawk)
23a5b5a3d9ba0d2c6277dbdaf2557033	/etc/alternatives/awk
23a5b5a3d9ba0d2c6277dbdaf2557033	/usr/bin/gawk

Once gawk is installed, it can be invoked with awk.

mail

Awk internal variables

$n
the nth element of the current line ($0 being the whole line itself) :
for i in {1..4}; do echo 'a b c d' | awk '{print "Item '$i' of line \""$0"\" is "$'$i'"."}'; done
FILENAME
  • name of the current input file
  • - when reading from standard input
  • empty string inside a BEGIN rule
FS
Field Separator. Can be set with -F
NF
  • number of fields in the current line :
    echo 'a b c' | awk '{print NF}'; echo 'joe jack william averell' | awk '{print NF}';
    3
    4
  • It is often used to refer to the last field of a line :
    echo 'a b c' | awk '{print $NF}'; echo 'joe jack william averell' | awk '{print $NF}';
    c
    averell
NR
number of records processed so far (which can be approximated to the number of the current row, starting at 1) :
for i in {a..e}; do echo $i; done | awk '{ print "line "NR":\t" $0}'
line 1: a
line 2: b
line 3: c
line 4: d
line 5: e
OFS
Output Field Separator. It is automatically inserted between fields by print. Defaults to a single space.
This is not a CLI flag, it goes into the "action" part :
  • echo {a..z} | awk '{OFS="."; print $1,$3,$5,$7}'
    a.c.e.g
  • No need to repeat the definition for every line of input :
    echo {a..z} | awk 'BEGIN{OFS="PLOP"} {print $1,$3,$5,$7}'
    aPLOPcPLOPePLOPg
RS
Records Separator
  • defaults to \n (NEWLINE) : by default Awk considers 1 record == 1 line of input
  • gawk also accepts regular expression
mail

Tailor files / strings / substrings with Awk / Bash / PERL / sed

awk

Extract specific fields from log files :

  • awk '$9 == "searchedKeyword" {print $7}' file.log | sort | uniq -c | sort -nr | head -n 10
  • awk '$6 ~ "30." {print $5" "$6}' file | ...

~ is the Awk operator to match a regular expression.

Bash (source)

Replace a substring :

  • ${string/substring/replacement} : replace 1st occurrence
  • ${string//substring/replacement} : replace all occurrences
  • myString='Hello World'; echo ${myString//[eo]/ab} : outputs Habllab Wabrld

Test whether a string matches a RegExp (source) :

testString='Hello World'; if [[ $testString =~ ^.*o.*o.*$ ]]; then echo "MATCHES"; else echo "DOESN'T MATCH"; fi

PERL

Apply a regExp to a string :

  • perl -e '$ARGV[0]=~ m/..(.)/; print $1' abcdef
  • echo AZERqsdfWXCV | xargs perl -e '$ARGV[0]=~ m/.{4}(.{4}).*(.)$/; print "$1 $2"'

sed

Extract (in CSV format) URL + hit/miss + generation time from a Varnish log :

sed -r 's/.*GET ([^ ]*).*(hit|miss) ([0-9.]*).*/\1;\2;\3/' access.log > result.log

Extract (in CSV format) URL + HTTP error code from Lighttpd log :

sed -r 's/^.*GET ([^ ]*).*HTTP\/1\.1" ([0-9]*).*$/\1;\2;/' /var/log/lighttpd/www.example.com.log > result.log

Same as above with HTTP 500 errors only + sorting results by descending number of occurrences :

logFile='/var/log/lighttpd/www.example.com.log'; resultFile='./result.csv'; tmpFile=$(mktemp --tmpdir tmp.result.XXXXXXXX); grep '" 500 ' $logFile | sed -r 's/^.*GET ([^ ]*).*HTTP\/1\.." ([0-9]*).*$/\1;\2;/' > $tmpFile; cat $tmpFile | sort | uniq -c | sort -nr > $resultFile; rm $tmpFile

Using grep 1st because sed can't find a match on every line, as we're reporting only on HTTP 500 errors.

Extract (in CSV format) several fields from Apache logs stored in a year/month/day directory tree :

resultFile='~/result.csv'; tmpFile=$(mktemp --tmpdir tmp.XXXXXXXX); csvHeader='web server;IP;HTTP method;URL used by method;full URL;'; echo $csvHeader > $tmpFile; logFilePath='/path/to/logfiles/'; startYear='2013'; endYear='2013'; startMonth='04'; endMonth='04'; startDay='01'; endDay='18'; for year in $(seq $startYear $endYear); do for month in $(seq $startMonth $endMonth); do for day in $(seq $startDay $endDay); do [ ${#month} -eq 1 ] && month='0'$month; [ ${#day} -eq 1 ] && day='0'$day; logFile=$logFilePath/$year/$month/$day/$year$month$day'-access.log'; echo "PROCESSING $logFile ..."; grep 'example.com' $logFile | grep -v 'GET' | sed -r 's/^.*(webServer(1|2)).* ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+) .*\] "([A-Z]*) (.*) HTTP.*" [0-9]+ [0-9]+ "([^"]+)".*/;\1;\3;\4;\5;\6/' | sort | uniq >> $tmpFile; done; done; done; cat $tmpFile | sort | uniq -c | sort -nr >> $resultFile; rm $tmpFile
mail

awk

Usage

Awk is a programmable text filter. Its input can be :

Output :

An Awk script is made of 3 blocks :

  1. pre-process : BEGIN
  2. process
  3. post-process : END

Awk reads the input line by line, then applies the specified filter(s) to detect whether or not to process the current line. Before starting processing a line, Awk splits it into fields and stores fields values in $1 (1st field), $2, ..., $NF (last field). $0 is the whole input line. The fields separator (specified with FS) defaults either to [SPACE] or [TAB] (details).

There is no need to use grep together with Awk as Awk "RegExp matches" lines to process.

Filters :

Criteria select matching lines select not matching lines
line number within input
awk 'NR==n {doSomething}'
echo -e 'a\nb\nc\nd' | awk 'NR==3'
c
echo -e 'a\nb\nc\nd' | awk 'NR==3 {next}; {print}'
a
b
d
line vs regular expression
awk '/regEx/ {doSomething}'
  • echo -e 'foo\nbar\nbaz' | awk '/bar/ {print $0}'
    bar
  • echo -e 'foo\nPool ID : 1234\nbar\nID du pool : 4321\nbaz' | awk '/(Pool ID|ID du pool)/ {print $NF}'
    1234
    4321
awk '!/regEx/ {doSomething}'
echo -e 'foo\nbar\nbaz' | awk '!/a/ {print $0}'
foo
(source, example)
line vs number of fields echo -e 'field1\nfield1\tfield2\nfield1\tfield2\tfield3' | awk 'NF == 2 {print $0}'
field1	field2
field vs number echo -e 'foo\t12\nbar\t34\nbaz\t56' | awk '$2 > 25 {print $0}'
bar	34
baz	56
Awk is smart enough to strip leading zeroes :
echo {01..10} | awk '$3 > 2 { print "ok" }'
ok
echo {01..10} | awk '$3 > 3 { print "ok" }'
(void)
echo {01..10} | awk '$3 >= 3 { print "ok" }'
ok
echo {0001..10} | awk '$3 >= 3 { print "ok" }'
ok
Trying to filter data based on line numbers returned by grep -n with a construct like :
grep -n --color=always [options] | awk -F ':' '$n > x {doSomething}'
may fail because of the returned color codes.
  • echo -e 'FOO\nBAR\nBAZ' | grep -n --color=always '.A' | awk -F ':' '$1>2 {print $0}'
  • echo -e 'FOO\nBAR\nBAZ' | grep -n '.A' | awk -F ':' '$1>2 {print $0}'
field vs string
awk '$n == "value" {doSomething}'
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 == "bar2" {print $0}'
foo2 bar2 baz2
awk '$n != "value" {doSomething}'
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 != "bar2" {print $0}'
foo1 bar1 baz1
foo3 bar3 baz3
field vs regular expression
awk '$n ~ /regEx/ {doSomething}'
  • for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 ~ /a.1/ {print $0}'
    foo1 bar1 baz1
  • find the shortest path :
    echo -e "bla dir1/\nbla dir1/dir2/\nbla dir1/dir2/dir3/" | awk '$NF ~ /^[^/]*\/$/ {print $NF}'
    dir1/
awk '$n !~ /regEx/ {doSomething}'
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 !~ /a.1/ {print $0}'
foo2 bar2 baz2
foo3 bar3 baz3
field vs regular expression with if / else construct (source) for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '{ if($2 ~ "a.2") {print "MATCH : "$2 } else {print "NO MATCH"} }'
NO MATCH
MATCH : bar2
NO MATCH
several conditions
awk 'condition1 logicalOperator condition2 logicalOperator ... conditionN {doSomething}'
logicalOperator can be (source) :
  • && : logical AND
  • || : logical OR
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$2 ~ "^ba.." && $3 == "baz3" {print $0}'
foo3 bar3 baz3
for i in {1..3}; do echo "foo$i bar$i baz$i"; done | awk '$1 ~ "1$" || $3 ~ "3$" {print $0}'
foo1 bar1 baz1
foo3 bar3 baz3
for i in {6..22}; do echo "a b c d e f g h $i"; done | awk '$NF==7 || $NF==21 {print $0}'
a b c d e f g h 7
a b c d e f g h 21
echo | awk '1==1 && (2==1 || 3==3) { print "ok" }'
ok

Numerical field with trailing unit letter or text :

If the numerical value has a unit letter, it doesn't work anymore :

echo -e "foo\t8U\nbar\t34U\nbaz\t56U" | awk '$2 > 25 {print $0}'
foo	8U	ooops !
bar	34U
baz	56U
solution :
echo -e "foo\t8U\nbar\t34U\nbaz\t56U" | awk 'strtonum($2) > 25 {print $0}'
bar	34U
baz	56U

Try it :

strtonum() looks smart enough to handle trailing units (source) :
awk 'BEGIN {
	print "trailing unit (single letter) : " strtonum("123U")
	print "trailing unit (word) : " strtonum("123potatoes")
	print "leading unit (single letter) : " strtonum("Y123")
	print "leading unit (word) : " strtonum("banana123")
	}'
trailing unit (single letter) : 123	OK
trailing unit (word) : 123		OK
leading unit (single letter) : 0	KO
leading unit (word) : 0			KO

Flags

Flag Usage
-F x use x as the input Field separator
  • x can be several characters long : echo 'GAABCDBUABCDZOABCDMEU' | awk -F 'ABCD' '{print $1,$2,$3,$4}'
  • default field separator : any run of spaces and/or tabs and/or newlines (excluding leading and trailing runs) (details)
-i awkLibrary
--include awkLibrary
load the specified library awkLibrary (example)
-v variable=value
  • declare a variable (example)
  • use multiple -v to declare several variables : -v variable1=value1 -v variable2=value2

Exit Status

Example

Process log files :

Count occurrences of an error message in a log file :

This code removes the [10-Oct-2012 18:15:46 UTC] fields from every logfile line. This is why Awk is taught to display all fields starting from the 4th :
awk '/^\[/ { for (i=4;i<=NF;i++) printf $i " ";print ""}' logFile

printf adds no carriage return after printing. print does.

Then count occurrences :
awk '/^\[/ { for (i=4;i<=NF;i++) printf $i " ";print ""}' logFile | sort | uniq -c | sort -nr

From a multiple-fields line, displays fields starting from the 4th :

In a log file such as :
[13-Nov-2013 03:03:35 Europe/Paris] PHP Warning: Memcached::touch(): ... in ....php on line 45
[13-Nov-2013 03:04:42 Europe/Paris] PHP Warning: file_get_contents(http://...): HTTP/1.0 404 Not Found in ...php on line 202
...
let's say you'd like to remove the date/time field to group and count similar errors. To do so :

awk '{ for (i=1;i<=3;i++) $i="";print }' file.log | awk '{sub(/^[ \t]+/, ""); print}' | sort | uniq -c | sort -nr

  • the 1st Awk command replaces the first 3 fields with an empty string, so that the line only contains the remaining fields, starting from the 4th as required
  • the 2nd Awk command just removes leading whitespaces (source)

Match a keyword from a variable :

Looks like you can't use a variable name within the // operator to select the matching line :

  • DON'T : echo -e 'apple\nbanana\ncarrot' | awk -v letterToMatch='b' '/letterToMatch/'
  • DO : echo -e 'apple\nbanana\ncarrot' | awk -v letterToMatch='b' '$1 ~ letterToMatch'…...
extra examples to illustrate the above :

echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '/stuffToMatch/'
(nothing)
echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$0 ~ /stuffToMatch/'
(nothing)
⇒ no match found, as said above


echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' 'stuffToMatch'
fruit: apple
fruit: banana
vegetable: carrot
⇒ matches everything


echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$0 ~ stuffToMatch'
fruit: banana
vegetable: carrot

echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$1 ~ stuffToMatch'
vegetable: carrot

echo -e 'fruit: apple\nfruit: banana\nvegetable: carrot' | awk -v stuffToMatch='b' '$2 ~ stuffToMatch'
fruit: banana
⇒ 3 examples work as expected

for httpCode in 301 302 304; do echo -n "Code $httpCode : "; awk -v needle="$httpCode" '$6 ~ needle {print " "}' logFile | wc -l; done

Selecting PID's :

ps --ppid 1 | awk '/d$/ {print $1}'
Lists processes whose parent's PID is 1, then selects processes whose name ends in 'd', and prints the corresponding PID, which is the line field #1.

Specifying the field separator :

awk -F ':' '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd

List all ports and PIDs on which a Mongodb instance is listening :

netstat -laputen | awk '/mongo/ {print "IP:port = "$4"\tPID = "$9}' | sort | uniq

Select non-empty lines :

echo -e 'A\tB\tC\tD\nE\tF\tG\tH\n\nI\tJ\tK\tL' | awk '!/^$/ {print $3}'

C
G
K

Dark wizardry ?

This awk command made me scratch my head quite a bit : it returns fields from 2 distinct lines, that even are not contiguous ().
To figure this out, I simplified it, and let the magic happen :
echo -e 'key1\tvalue1\nkey2\tvalue2' | awk '/key1|key2/ { printf $2 " " }'
value1 value2

Explanation :

  • the input (either an echo, a line "piped" in, or a whole file) is perfectly "normal" : there is no hack regarding field separators or end of line markers.
  • the /key1|key2/ part of the awk command is a "normal" regular expression alternation
  • the printf $2 " " part simply prints the 2nd field of each matching line, followed by a space
So what's the trick ?
Let's have a deeper look at how awk works and what we're instructing it to do with the echo | awk command above :
  1. no pre-process, so let's start eating lines and doing things
  2. awk splits the input into distinct lines
  3. awk reads the 1st line : key1\tvalue1
  4. does it match the regular expression ? Yes, so print the 2nd field and a space character : value1 
  5. done with this line, continue with the next line
  6. read the 2nd line : key2\tvalue2
  7. does it match the regular expression ? Yes again, so print the 2nd field and a space character : value2 
  8. The trick is that awk does not automatically add a newline character after printing. So the output of any step is printed right after the output of the previous step. This is why, at this step of the procedure, the output looks like : value1 value2 
  9. done with this line, no next line
  10. no post-process
  11. the end !