9. Debugging Shell Programs

Contents:
Basic Debugging Aids
A Korn Shell Debugger

We hope that we have convinced you that the Korn shell can be used as a serious UNIX programming environment. It certainly has enough features, control structures, etc. But another essential part of a programming environment is a set of powerful, integrated support tools. For example, there is a wide assortment of screen editors, compilers, debuggers, profilers, cross-referencers, etc., for languages like C and C++. If you program in one of these languages, you probably take such tools for granted, and you would undoubtedly cringe at the thought of having to develop code with, say, the ed editor and the adb machine-language debugger.

But what about programming support tools for the Korn shell? Of course, you can use any editor you like, including vi and emacs. And because the shell is an interpreted language, you don't need a compiler. [1] But there are no other tools available. The most serious problem is the lack of a debugger.

[1] Actually, if you are really concerned about efficiency, there are shell code compilers on the market; they convert shell scripts to C code that often runs quite a bit faster.

This chapter addresses that lack. The shell does have a few features that help in debugging shell scripts; we'll see these in the first part of the chapter. The Korn shell also has a couple of new features, not present in most Bourne shells, that make it possible to implement a full-blown debugging tool. We'll show these features; more importantly, we will present kshdb, a Korn shell debugger that uses them. kshdb is basic yet quite useable, and its implementation serves as an extended example of various shell programming techniques from throughout this book.

9.1 Basic Debugging Aids

What sort of functionality do you need to debug a program? At the most empirical level, you need a way of determining what is causing your program to behave badly, and where the problem is in the code. You usually start with an obvious what (such as an error message, inappropriate output, infinite loop, etc.), try to work backwards until you find a what that is closer to the actual problem (e.g., a variable with a bad value, a bad option to a command), and eventually arrive at the exact where in your program. Then you can worry about how to fix it.

Notice that these steps represent a process of starting with obvious information and ending up with often obscure facts gleaned through deduction and intuition. Debugging aids make it easier to deduce and intuit by providing relevant information easily or even automatically, preferably without modifying your code.

The simplest debugging aid (for any language) is the output statement, print in the shell's case. Indeed, old-timer programmers debugged their FORTRAN code by inserting WRITE cards into their decks. You can debug by putting lots of print statements in your code (and removing them later), but you will have to spend lots of time narrowing down not only what exact information you want but also where you need to see it. You will also probably have to wade through lots and lots of output to find the information you really want.

9.1.1 Set Options

Luckily, the shell has a few basic features that give you debugging functionality beyond that of print. The most basic of these are options to the set -o command (as covered in Chapter 3, Customizing Your Environment). These options can also be used on the command line when running a script, as Table 9.1 shows.

The verbose option simply echoes (to standard error) whatever input the shell gets. It is useful for finding the exact point at which a script is bombing. For example, assume your script looks like this:

fred
bob
dave
pete
ed
ralph

Table 9.1: Debugging Options
set -o Option	Command-line Option	Action
noexec	-n	Don't run commands; check for syntax errors only
verbose	-v	Echo commands before running them
xtrace	-x	Echo commands after command-line processing

None of these commands are standard UNIX programs, and they all do their work silently. Say the script crashes with a cryptic message like "segmentation violation." This tells you nothing about which command caused the error. If you type ksh -v scriptname, you might see this:

fred
bob 
dave
segmentation violation
pete
ed
ralph

Now you know that dave is the probable culprit-though it is also possible that dave bombed because of something it expected fred or bob to do (e.g., create an input file) that they did incorrectly.

The xtrace option is more powerful: it echoes command lines after they have been through parameter substitution, command substitution, and the other steps of command-line processing (as listed in Chapter Chapter 7, Input/Output and Command-line Processing). For example:

$ set -o xtrace 
$ fred=bob 
+ fred=bob
$ print "$fred" 
+ print bob
bob
$ ls -l $(whence emacs) 
+ whence emacs
+ ls -l /usr/share/bin/emacs
-rwxr-xr-x  1 root      1593344 Apr  8  1991 /usr/share/bin/emacs
$

As you can see, xtrace starts each line it prints with +. This is actually customizable: it's the value of the built-in shell variable PS4. So if you set PS4 to "xtrace-> " (e.g., in your .profile or environment file), then you'll get xtrace listings that look like this:

$ ls -l $(whence emacs) 
xtrace-> whence emacs
xtrace-> ls -l /usr/share/bin/emacs
-rwxr-xr-x  1 root      1593344 Apr  8  1991 /usr/share/bin/emacs
$

An even better way of customizing PS4 is to use a built-in variable we haven't seen yet: LINENO, which holds the number of the currently running line in a shell script. Put this line in your .profile or environment file:

PS4='line $LINENO: '

We use the same technique as we did with PS1 in Chapter 3: using single quotes to postpone the evaluation of the string until each time the shell prints the prompt. This will print messages of the form line N: in your trace output. You could even include the name of the shell script you're debugging in this prompt by using the positional parameter $0:

PS4='$0 line $LINENO: '

As another example, say you are trying to track down a bug in a script called fred that contains this code:

dbfmq=$1.fmq
...
fndrs=$(cut -f3 -d' ' $dfbmq)

You type fred bob to run it in the normal way, and it hangs. Then you type ksh -x fred bob, and you see this:

+ dbfmq=bob.fmq
...
+ + cut -f3 -d

It hangs again at this point. You notice that cut doesn't have a filename argument, which means that there must be something wrong with the variable dbfmq. But it has executed the assignment statement dbfmq=bob.fmq properly... ah-hah! You made a typo in the variable name inside the command substitution construct. [2] You fix it, and the script works properly.

[2] We should admit that if you turned on the nounset option at the top of this script, the shell would have flagged this error.

If the code you are trying to debug calls functions that are defined elsewhere (e.g., in your .profile or environment file), you can trace through these in the same way with an option to the typeset command. Just enter the command typeset -ft functname, and the named function will be traced whenever it runs. Type typeset +ft functname to turn tracing off.

The last option is noexec, which reads in the shell script, checks for syntax errors, but doesn't execute anything. It's worth using if your script is syntactically complex (lots of loops, code blocks, string operators, etc.) and the bug has side effects (like creating a large file or hanging up the system).

You can turn on these options with set -o in your shell scripts, and, as explained in Chapter 3, turn them off with set +o option. For example, if you're debugging a script with a nasty side effect, and you have localized it to a certain chunk of code, you can precede that chunk with set -o noexec (and, perhaps, close it with set +o noexec) to avoid the side effect.

9.1.2 Fake Signals

A more sophisticated set of debugging aids is the shell's three "fake signals," which can be used in trap statements to get the shell to act under certain conditions. Recall from the previous chapter that trap allows you to install some code that runs when a particular signal is sent to your script.

Fake signals act like real ones, but they are generated by the shell (as opposed to real signals, which the underlying operating system generates). They represent runtime events that are likely to be interesting to debuggers-both human ones and software tools-and can be treated just like real signals within shell scripts. The three fake signals and their meanings are listed in Table 9.2.

Table 9.2: Fake Signals
Fake Signal	When Sent
EXIT	The shell exits from a function or script
ERR	A command returns a non-0 exit status
DEBUG	After every statement

9.1.2.1 EXIT

The EXIT trap, when set, will run its code when the function or script within which it was set exits. Here's a simple example:

function func {
    print 'start of the function'
    trap 'print /'exiting from the function/'' EXIT
}

print 'start of the script'
trap 'print /'exiting from the script/'' EXIT
func

If you run this script, you will see this output:

start of the script
start of the function
exiting from the function
exiting from the script

In other words, the script starts by printing a message. Then it sets the trap for its own exit, then calls the function. The function does the same-prints a message and sets a trap for its exit. (Remember that functions can have their own local traps that supersede any traps set by the surrounding script.)

The function then exits, which causes the shell to send it the fake signal EXIT, which in turn runs the code print 'exiting from the function'. Then the script exits, and its own EXIT trap code is run.

An EXIT trap occurs no matter how the script or function exits-whether normally (by finishing the last statement), by an explicit exit or return statement, or by receiving a "real" signal such as INT or TERM. Consider the following inane number-guessing program:

trap 'print /'Thank you for playing!/'' EXIT

magicnum=$(($RANDOM%10+1))
print 'Guess a number between 1 and 10:'
while read guess'?number> '; do
    sleep 10
    if (( $guess == $magicnum )); then
        print 'Right!'
        exit
    fi
    print 'Wrong!'
done

This program picks a number between 1 and 10 by getting a random number (the built-in variable RANDOM), extracting the last digit (the remainder when divided by 10), and adding 1. Then it prompts you for a guess, and after 10 seconds, it will tell you if you guessed right.

If you did, the program will exit with the message, "Thank you for playing!", i.e., it will run the EXIT trap code. If you were wrong, it will prompt you again and repeat the process until you get it right. If you get bored with this little game and hit [CTRL-C] while waiting for it to tell you whether you were right, you will also see the message.

9.1.2.2 ERR

The fake signal ERR enables you to run code whenever a command in the surrounding script or function exits with non-zero status. Trap code for ERR can take advantage of the built-in variable ?, which holds the exit status of the previous command. It "survives" the trap and is accessible at the beginning of the trap-handling code.

A simple but effective use of this is to put the following code into a script you want to debug:

function errtrap {
    es=$?
    print "ERROR: Command exited with status $es."
}

trap errtrap ERR

The first line saves the non-zero exit status in the variable es. This code enables you to see which command in your script exits with error status and what the status is.

For example, if the shell can't find a command, it returns status 1. If you put the code in a script with a line of gibberish (like "lskdjfafd"), the shell will respond with:

scriptname[N]: lskdjfafd:  not found
ERROR: command exited with status 1.

N is the number of the line in the script that contains the bad command. In this case, the shell prints the line number as part of its own error-reporting mechanism, since the error was a command that the shell could not find. But if the non-0 exit status comes from another program, the shell won't report the line number. For example:

function errtrap {
    es=$?
    print "ERROR: Command exited with status $es."
}

trap errtrap ERR

function bad {
    return 17
}

bad

This will only print, ERROR: Command exited with status 17.

It would obviously be an improvement to include the line number in this error message. The built-in variable LINENO exists, but if you use it inside a function, it evaluates to the line number in the function, not in the overall file. In other words, if you used $LINENO in the print statement in the errtrap routine, it would always evaluate to 2.

To get around this problem, we simply pass $LINENO as an argument to the trap handler, surrounding it in single quotes so that it doesn't get evaluated until the fake signal actually comes in:

function errtrap {
    es=$?
    print "ERROR line $1: Command exited with status $es."
}
trap 'errtrap $LINENO' ERR
...

If you use this with the above example, the result is the message, ERROR line 12: Command exited with status 17. This is much more useful. We'll see a variation on this technique shortly.

This simple code is actually not a bad all-purpose debugging mechanism. It takes into account that a non-0 exit status does not necessarily indicate an undesirable condition or event: remember that every control construct with a conditional (if, while, etc.) uses a non-0 exit status to mean "false". Accordingly, the shell doesn't generate ERR traps when statements or expressions in the "condition" parts of control structures produce non-0 exit statuses.

But a disadvantage is that exit statuses are not as uniform (or even as meaningful) as they should be-as we explained in Chapter 5, Flow Control. A particular exit status need not say anything about the nature of the error or even that there was an error.

9.1.2.3 DEBUG

The final fake signal, DEBUG, causes the trap code to be run after every statement in the surrounding function or script. This has two possible uses. First is the use for humans, as a sort of a "brute force" method of tracking a certain element of a program's state that you notice is going awry.

For example, you notice that the value of a particular variable is running amok. The naive approach would be to put in lots of print statements to check the variable's value at several points. The DEBUG trap makes this easier by letting you do this:

function dbgtrap {
    print "badvar  is $badvar "
}

trap dbgtrap DEBUG

...section of code in which problem occurs... 

trap - DEBUG		# turn off DEBUG trap

This code will print the value of the wayward variable after every statement between the two traps.

The second and far more important use of the DEBUG trap is as a primitive for implementing Korn shell debuggers. In fact, it would be fair to say that the DEBUG trap reduces the task of implementing a useful shell debugger from a large-scale software development project to a manageable exercise. Read on.


8.6 Subshells		9.2 A Korn Shell Debugger