Heiner's SHELLdorado
Good Shell Coding Practices
SHELLdorado - your UNIX shell scripting resource
HomeGood Shell Coding PracticesExample Shell ScriptsShell Scripting LinksShell Scripting Tips+TricksShell Scripting Articles


Good Coding - Home
 
Calling AWK from a shell script

  Command line arguments
  Interpreter Lines
  Bibliography

Previous Page | Top | Next Page

3. Invoking AWK programs

AWK (named after its creators Al Aho, Peter Weinberger and Brian Kernighan) is a very powerful text processing language. It features automatic splitting of each input line in fields, associative arrays (arrays indexed by strings), and built-in string oriented functions.

Brian Kernighan said about AWK:

It was originally for writing these one and two line programs. It really was. I think it's very seductive because it does so many things automatically. It handles strings and numbers smoothly. It is an interpreter and there's no baggage, no derived object files. People start to write a one and two line program that just grows and grows; some of them grow unbelievably large: tens of thousands of lines -- which is nonsense.
Brian Kernighan, cited by Peter H. Salus (who in turn cites Peter Collinson from the ".EXE" magazine). From A Quarter Century of UNIX, pp. 103-104.

This article describes three ways to interface AWK programs with shell scripts and how to import shell variables into AWK programs.

This text assumes a good understanding of AWK and shell scripting. If you want to learn how to program using AWK, you should read an AWK introduction, e.g. one of the documents in the bibliography

.
If appropriate we will differentiate between oawk, nawk, and awk.

oawk (old AWK) is the first AWK version, and is still around on many UNIX systems. If you have an oawk on your system, you probably have nawk, too. There is no need to prefer oawk to nawk, except for older AWK scripts that require the older AWK version.

nawk (new AWK) is an extension of OAWK that now is the standard AWK version. Any references made here to this version apply to the GNU AWK gawk as well.

If we refer to any AWK version we will just write AWK. If you are interested in other AWK programs for different operating systems, you should have a look at the AWK FAQ.


Calling AWK from a shell script

Have a look at the following script, that searches a text string within files (like grep) using AWK:

:
# textsearch - search text in files
# This example does not work!

if [ $# -gt 0 ]
then
    SearchString="$1"
    shift
else
    echo >&2 "usage: $0 searchstring [file ...]"
    exit 1
fi

awk '/SearchString/ { print }' "$@"

This script may be called the following way:

    $ textsearch main *.c

For now we'll ignore the fact that the script is not working correctly and describe how it should have worked.

The script assigns the first command line parameter ("main" in the command above) to the script variable SearchString and then calls awk to search this given string in all c files ("*.c") specified on the command line. The special shell variable $@ will be expanded to the file name list.

At this time, however, it only searches the constant SearchString instead of the value of the script variable SearchString. The script will find all occurrences of the string "SearchString" in all files specified - no matter what search string we specify on the command line.

But how do we get the contents of the shell script variable into the AWK program?

There are three major ways to do achieve this:

  1. Embedding the program into a shell script.
  2. Using the awk -v command line option.
  3. Using "pseudo files"; specifying variable=value pairs on the command line.

The second method has the disadvantage of not being portable to older versions of awk (and even different versions of nawk). The third method has some disadvantages we will describe later. Therefore we will explain the first, preferred method in detail.


Shell script embedding

If we call AWK the following way:

:
# textsearch - search text
# This example does not work!

if [ $# -gt 0 ]
then
    SearchString="$1"
    shift
else
    echo >&2 "usage: $0 searchstring [file ...]"
    exit 1
fi

awk '/SearchString/ {print}' "$@"

The awk program is called with one or more arguments: the first argument (marked red) is the complete AWK program, followed by the files specified on the command line.

If we want to use the shell variable SearchString from within the AWK program, why don't we let the shell expand the shell variable before AWK sees the program?

This could work the following way:

:
# textsearch - search text

if [ $# -gt 0 ]
then
    SearchString="$1"
    shift
else
    echo >&2 "usage: $0 searchstring [file ...]"
    exit 1
fi

awk '/'$SearchString'/ {print}' "$@"

In this example the AWK program consists of three parts:

  • The first part consists only of the character "/" that introduces a search pattern.
  • The second part consists of the contents of the shell variable SearchString.
  • The third part consists of the character "/" that ends the search pattern, and the AWK action "{print}" that prints the line matching the pattern (we could omit the "{print}", because it is the default action).

It is essential that all three parts are written together without any whitespace, because AWK only takes one program on the command line and will complain about any further program found.

What happens if we call this script textsearch with "hello" as an argument?

    $ textsearch hello *.doc

Inside of the script the first argument "hello" will be assigned to the shell variable SearchString, and AWK will be called the following way:

awk '/hello/ {print}' file1.doc file2.doc

We now have exactly the solution for our problem: this is a way to import a shell environment variable into AWK.

There's still one problem left. Consider the following invocation of our script:

    $ textsearch "our house" *.doc

Now the variable SearchString gets the value "our house", which results in the following AWK invocation:

awk '/our' 'house/ {print}' file1.doc file2.doc

Now our AWK program (marked red) is split in two parts, resulting in AWK error messages. The first part '/our' is taken to be the (invalid) program code, and 'house/ {print}' to be an (invalid) file name.

The solution to this problem is simple: the shell environment variable should be enclosed in quotes:

awk '/'"$SearchString"'/ {print}' "$@"

Now you are able to write large AWK programs that may use shell script variables. The embedding of AWK programs in shell scripts is easy to use, portable, and allows the usage of arbitrary complex shell script commands for input pre- or post processing.

The following example uses the technique described above to transfer the name of a file into the AWK script (marked red).

The script substitute substitutes arbitrary words in the input with other words specified in the file substitute.tab in the current directory. The file contains lines in the format

oldword newword

Each oldword in the input is substituted with newword in the output.

:
# substitute - substitute words with other words

SubstFile=substitute.tab
if [ -r "$SubstFile" ]
then
    echo >&2 \
    "reading substitution table $SubstFile"
else
    echo >&2 \
    "cannot read substitution table $SubstFile"
    exit 1
fi

# We need a newer version of AWK, because oawk
# does not support the "getline < FILE" statement
nawk '
    BEGIN {
        # Read the whole substitution file
	# into the array tab[].
	# Format of the substitution file:
	# 	oldword	newword
        substfile = "'"$SubstFile"'"
        while ( getline < substfile ) {
	    tab [$1] = $2  # fill conversion table
	}
	close (substfile)
    }
    {
        for ( i=1; i<=NF; i++ ) {
	    if ( tab [$i] != "" ) {
		# substitute old word
	        $i = tab [$i]
	    }
	}
	print
    }
' "$@"

Some comments on the script:

  • We call nawk instead of awk because only the newer awk supports "getline < FILE"
  • The assignment "$i = tab [$i]" has a side effect on the whole line $0. After the assignment all fields of the line are separated by exactly one blank. If the line e.g. was
        one           two         three
    
    and we substituted "two" with "TWO" the resulting line would be
        one TWO three
    
    This "feature" is almost never desired. AWK rebuilds the whole line after an assignment to a field using OFS as delimiter (the default is one blank).

    If the line "$i = tab [$i]" were replaced with

        gsub ($i, tab [$i])
    
    the whitespace between the words would be preserved, but since gsub() does not honour word delimiters, parts of words could be changed accidently1.
Portability:
AWK script embedding within shell scripts works with all shells derived from the Bourne Shell, e.g. sh, ksh, ksh93, bash, pdksh, zsh.

Using the awk -v command line option

The following code fragment rewrites the first example of this text to use the -v command line option to awk:

:
# textsearch - search text in files

if [ $# -gt 0 ]
then
    SearchString="$1"
    shift
else
    echo >&2 "usage: $0 searchstring [file ...]"
    exit 1
fi

awk -v Search="$SearchString" '$0 ~ Search' "$@"

The script parts marked red assign the contents of the shell script variable SearchString to the AWK variable Search. This variable is then used inside of the AWK script (marked blue) to match a line.

Note that we changed the search command from "/SearchString/" to "$0 ~ Search", because AWK variables may not be used between the pattern matching operator /.../.

Portability:
The -v option is available with POSIX compliant awk implementations. The major disadvantage of this method is, that it's not widely portable. gawk supports it, but oawk does not. Some of the nawk programs support it, some (e.g. SunOS 4.1.3) do not.


Pseudo-files

AWK knows another way to assign values to AWK variables, like in the following example:

$ awk '{ print "var is", var }' var=TEST file1 file2

This statement assigns the value "TEST" to the AWK variable "var", and then reads the files "file1" and "file2". The assignment works, because AWK interprets each file name containing an equal sign ("=") as an assignment.

This example is very portable (even oawk understands this syntax), and easy to use. So why don't we use this syntax exclusively?

This syntax has two drawbacks: the variable assignment are interpreted by AWK the moment the file would have been read. At this time the assignment takes place. Since the BEGIN action is performed before the first file is read, the variable is not available in the BEGIN action.

The second problem is, that the order of the variable assignments and of the files are important. In the following example

$ awk '{ print "var is", var }' file1 var=TEST file2

the variable var is not defined during the read of file1, but during the reading of file2. This may cause bugs that are hard to track down.

An equally portable way to achieve the same result is Shell script embedding, the preferred method.

Portability:
Assigning variables on the command line is very portable, because even the first versions of AWK support it. The internal handling of AWK may cause subtle bugs, however, and other methods should be preferred.


AWK command line arguments

One way to start AWK scripts is to invoke awk with the command line flag -f and the AWK script name, e.g.

    $ awk -f scriptname.awk

This usage has some disadvantages:

  • A user always has to specify the script name, if necessary with script path.
  • There is no pre- or post-processing (e.g. sorting) of the AWK script output possible, except if specified on the command line.
  • Often a shell script is needed for pre- or post processing, leading to two separate files (the AWK script and the shell script) that have to be maintained.

Since there is a better way to invoke the AWK script, we will not explain this syntax further.

Portability:
All awk versions support the -f flag.


Using an interpreter line

An interpreter line is the first line of an executable text (non-binary) file. If the first two characters of the file are "#!", the remainder of the line is taken to be the name of an interpreter (an binary executable file). This program is then started with the file text [TODO: how? on stdin?].

This way any script may call its own interpreter, e.g.

    #! /bin/awk -f
    BEGIN {
        print "this script is read by AWK"
    }

This is a comfortable way to call AWK scripts, because in contrary to the "awk -f" solution the user does not have to remember the whole path for the script (if his PATH environment variable is set correctly).

The programmer, however still does not have a way to pre- or postprocess the input/output of the AWK script.

Portability:
Interpreter lines are a relatively new UNIX feature that is now widely available. It's available on System V Release 4 based systems (e.g. Solaris), but not on older System V UNIXes.


Bibliography


Footnotes

1 Thanks to Stefan Lagotzki <lago20@gmx.de> for suggesting this.

Previous Page | Top | Next Page
   
Copyright © 1998-2022 Heiner Steven (heiner.steven@shelldorado.com)