Code Complete: A Practical Handbook of Software Construction. Redmond, Wa.: Microsoft Press, 880 pages, 1993. Retail price: $35. ISBN: 1-55615-484-4. 

Buy Code Complete from Amazon.com.


26.2 Finding an Error   

Debugging consists of two parts: finding the error and fixing it. Finding the error (and understanding it) is usually 90 percent of the work.

Fortunately, you don't have to make a pact with Satan in order to find an approach to debugging that's better than random guessing. Contrary to what the devil wants you to believe, debugging by thinking about the problem is much more effective and interesting than debugging with an eye of newt and the dust of a frog's ear.

Suppose you were asked to solve a murder mystery. Which would be more interesting: going door to door throughout the county, checking every person's alibi for the night of October 17? Or finding a few clues and deducing the murderer's identity? Most people would rather deduce the person's identity, and most programmers find the intellectual approach to debugging more satisfying. Even better, the effective programmers who debug in 1/20 of the time of the ineffective programmers aren't randomly guessing about how to fix the program. They're using the scientific method.

The Scientific Method of Debugging

Here are the elements of the scientific method:

1. Gather data through repeatable experiments.

2. Form a hypothesis using as much of the relevant data as possible.

3. Design an experiment to prove or disprove the hypothesis.

4. Prove or disprove the hypothesis.

5. Repeat as needed.

This process has many parallels in debugging. Here's an effective approach for finding an error:

1. Stabilize the error.

2. Locate the source of the error.

3. Fix the error

4. Test the fix

5. Look for similar errors.

The first step is similar to the scientific method in that it relies on repeatability. The defect is easier to diagnose if you can make it occur reliably. The second step uses all the steps of the scientific method. Gather the test data that divulged the error. Analyze the data that has been produced, and form a hypothesis about the source of the error. Design a test case or inspection to evaluate the hypothesis, then declare success or renew your efforts, as appropriate.

Let's look at each of the steps in more detail.

An Example

For purposes of the following discussion, assume that you have an employee database program with an intermittent error. The program is supposed to print a list of employees and income tax withholdings in alphabetical order. Here's part of the output:

Formatting, Fred Freeform      $5,877
Goto, Gary                     $1,666
Modula, Mildred               $10,788
Many-Loop, Mavis               $8,889
Statement, Sue Switch          $4,000
Whileloop, Wendy               $7,860

Stabilize the Error

If you can't make a defect occur reliably, it's almost impossible to diagnose. Making an intermittent defect occur predictably is one of the most challenging tasks in debugging.

An error that doesn't occur predictably is usually an initialization error or a dangling pointer problem. If the problem is that sometimes the sum of a calculation is right and sometimes it isn't, probably a variable involved in the calculation isn't being initialized properly-most of the time it just happens to start at 0. If the problem is a strange and unpredictable phenomenon and you're using pointers, you almost certainly have an unitialized pointer or are using a pointer after the memory that it points to has been deallocated. Suggestions for minimizing pointer problems are given in Section 11.9.

Stabilizing the error usually requires more than finding a test case that produces the error. It includes narrowing the test case to the simplest one that still produces the error. If you work in an organization that has an independent test team, sometimes it's their job to make the test cases simple. Most of the time, it's your job.

In simplifying the test case, the scientific method again comes into play. Suppose you have ten factors that, used in combination, produced the error. Form a hypothesis about the factors that were irrelevant in producing the error. Change the supposedly irrelevant factors, and rerun the test case. If you still get the error, you can eliminate those factors and you've simplified the test. You can then try to simplify it further. If you don't get the error, you've disproven that specific hypothesis, and you know more than you did before. It might be that some subtly different change would still produce the error, but you know at least one specific change that does not.

For example, in the employee-withholdings example given above, Many-Loop, Mavis is listed after Modula, Mildred, which is out of alphabetical order. When program is run a second time, however, the listing changes:

Formatting, Fred Freeform      $5,877
Goto, Gary                     $1,666
Many-Loop, Mavis               $8,889
Modula, Mildred               $10,788
Statement, Sue Switch          $4,000
Whileloop, Wendy               $7,860

The list is now correct. It isn't until Fruit-Loop, Frita is entered and shows up in an incorrect position that you remember that Modula, Mildred had been entered right before she showed up in the wrong spot, too. What's odd about both cases is that they were entered singly. Usually employees are entered in groups.

Hypothesis: The problem has something to do with entering a single new employee.

If this is true, running the programming again should put Fruit-Loop, Frita in the right order. Here's the result of a second run:

Formatting, Fred Freeform      $5,877
Fruit-Loop, Frita              $5,771
Goto, Gary                     $1,666
Many-Loop, Mavis               $8,889
Modula, Mildred               $10,788
Statement, Sue Switch          $4,000
Whileloop, Wendy               $7,860

This supports the hypothesis. To confirm it, you want to try adding a few new employees, one at a time, to see if they show up in the right order.

Locate the Source of the Error

The goal of simplifying the test case is to make it so simple that changing any aspect of it changes the behavior of the error. Then, by changing the test case carefully and watching its behavior under controlled conditions, you can diagnose the problem by watching the program.

Locating the source of the error also uses the scientific method. You might suspect that the defect is a result of a specific problem, say an off-by-one error. You could then vary the parameter you suspect is causing the problem-one below the boundary, on the boundary, and one above the boundary-and determine whether your hypothesis is correct.

In the running example, the source of the problem could be an off-by-one error that occurs when you add one new employee, but not when you add two or more. Examining the code, you don't find an obvious off-by-one error. Resorting to Plan B, you run a test case with a single new employee to see if that's the problem. You add Hardcase, Henry as a single employee and hypothesize that his record will be out of order. Here's what you find:

Formatting, Fred Freeform      $5,877
Fruit-Loop, Frita              $5,771
Goto, Gary                     $1,666
Hardcase, Henry                  $493
Many-Loop, Mavis               $8,889
Modula, Mildred               $10,788
Statement, Sue Switch          $4,000
Whileloop, Wendy               $7,860

The line for Hardcase, Henry is exactly where it should be, showing that the first hypothesis is false. The problem isn't caused simply adding one employee at a time. It's either more complicated or something completely different.

Examining the test-run output again, you notice that Fruit-Loop, Frita and Many-Loop, Mavis are the only names with hyphens. Fruit-Loop was out of order when she was first entered, but Many-Loop wasn't, was she? You don't have a printout from her original entry. But in the original error Modula, Mildred appeared to be out of order, but she was next to Many-Loop. Maybe Many-Loop was out of order and Modula was all right.

Hypothesis: The problem arises from names with hyphens, not names that are entered singly.

But how does that account for the fact that the problem shows up only the first time an employee is entered? You look at the code and find that two different sorting routines are used. One is used when an employee is entered, and another is used when the data is saved. A closer look at the routine used when an employee is first entered shows that it isn't supposed to sort the data completely. It only puts the data in approximate order to speed up the save-routine's sorting. Thus, the problem is that the data is printed without being sorted. The problem with hyphenated names arises because the rough-sort routine doesn't handle niceties like punctuation characters. Thus, you can refine the hypothesis even further:

Hypothesis: Names with punctuation characters aren't sorted correctly until they're saved.

You later confirm this hypothesis with additional test cases.

Tips for Finding Errors

Once you've stabilized the error and refined the test case that produces it, finding its source can be either trivial or challenging, depending on how well you've written your code. If you're having a hard time finding an error, it's probably because the code isn't well written. You might not want to hear that, but it's true. If you're having trouble, consider the following tips.

Use all the data available to make your hypothesis. When creating a hypothesis about the source of a defect, account for as much of the data as you can in your hypothesis. In the earlier example, you might have noticed that Fruitloop, Frita was out of order and made a hypothesis that names beginning with an "F" are sorted incorrectly. That's a poor hypothesis because it doesn't account for the fact that Modula, Mildred was out of order or that names are sorted correctly the second time around. If data doesn't fit the hypothesis, don't discard the data-ask why it doesn't fit, and create a new hypothesis.

On the other hand, the second hypothesis in the example, that the problem arises from names with hyphens, not names that are entered singly, didn't seem initially to account for the fact that names were sorted correctly the second time around, either. In this case, however, the hypothesis led to a more refined hypothesis which proved to be correct. It's all right not to account for all of the data at first as long as the hypothesis is refined so that it does eventually.

Refine the test cases that produce the error. If you can't find the source of an error, try to refine the test cases further than you already have. You might be able to vary one parameter more than you had assumed, and focusing on one of the parameters may provide the crucial breakthrough.

Reproduce the error several different ways. Sometimes trying things that are similar to the error-producing case, but not exactly the same, is instructive. Think of it in terms of triangulating the error. If you can get a fix on it from one point, and a fix on it from another, then you can determine exactly where it is.

Triangulating an error.

Try to reproduce an error several different ways to determine the exact cause of the error.

Reproducing the error several different ways helps diagnose the cause of the error. Once you think you've identified the error, run a case that's close to the cases that produce errors but which should not produce an error itself. If it does produce an error, you don't completely understand the problem yet. Errors often arise from combinations of factors, and diagnosing the problem with one test case sometimes doesn't diagnose the root problem.

Generate more data to generate more hypotheses. Choose test cases that are different from the test cases you already know to be erroneous or correct. Run them to generate more data, and use the new data to add to your list of possible hypotheses.

Use results of negative tests. Suppose you make a hypothesis and run a test case to prove it. Suppose the test case disproves it, so that you still don't know where the error is. You still know something you didn't before, namely that the error is not in the area you thought it was. That narrows your search field and the set of possible hypotheses.

Brainstorm for possible hypotheses. Rather than limiting yourself to the first hypothesis you think of, try to come up with several. Don't analyze them at first, just come up with as many as you can in a few minutes. Then look at each of them and think about test cases that would prove or disprove them. This mental exercise is helpful in breaking the debugging logjam that results from concentrating on a single line of reasoning.

Narrow the suspicious region of the code. If you're testing the whole program, or a whole module or routine, test a smaller part instead. Systematically remove parts of the program, and see if the error still occurs. If it does, you know it's in the part you took away. If it doesn't, you know it's in the part you've kept.

Rather than removing regions haphazardly, divide and conquer. Use a binary search algorithm to focus your search. Try to remove about half the code the first time. Determine the half the error is in, then chop that section in half. Again, determine the half the error is in, and again, chop that section in half. Continue until you find the error.

If you make use many small routines, you'll be able to chop out sections of code simply by commenting out calls to the routines. Otherwise, you can use comments or preprocessor commands to remove code.

If you're using a debugger, you don't necessarily have to remove pieces of code. Set a breakpoint partway through the program and check for the error that way instead. If your debugger allows you to skip calls to routines, eliminate suspects by not executing certain routines and seeing if the error still occurs. The process with a debugger is otherwise similar to the one in which pieces of a program are physically removed.

Check code that's changed recently. If you have a new error that's hard to find, it's usually related to code that's changed recently. It could be in completely new code or in changes to old code. If you can't find an error, run an old version of the problem and see if the error occurs. If it doesn't, you know the error's in the new version or caused by an interaction with the new version. Compare the differences between the old and new versions and scrutinize the differences.

Expand the suspicious region of the code. It's easy to focus too much on a small section of code, saying "the error must be in this section." If you don't find it in the section, consider the possibility that the error isn't in the section. Expand the area of code you view with suspicion, then focus on pieces of it using the binary search technique described above.

Integrate incrementally. As described in Chapter 27 on integration, debugging is easy if you add pieces to a system one at a time. If you add a piece to a system and encounter a new error, remove the piece and test it separately. Strap on a test harness and exercise the routine by itself to determine what's wrong.

Be suspicious of routines that have had errors before. Contrary to common intuition, routines that have had errors before will continue to have errors. A routine that has been troublesome in the past is more likely to contain a new error than a routine that has been error-free. Re-examine error-prone routines.

Use brute force. If you've used incremental integration and a new error raises its ugly head, you'll have a small section of code in which to check for the error. It's sometimes tempting to run the integrated code to find the error, rather than dis-integrating the code and checking the new routine by itself. Running a test case through the integrated system, however, may require a few minutes whereas running one through the code you're integrating takes only a few seconds. If you don't find the error by running the whole system on the first or second time, bite the bullet, dis-integrate the code, and debug the new code separately.

Set a maximum time for quick and dirty debugging. This is related to the previous point, but it's more general. It's always tempting to try for a quick guess rather than sytematically instrumenting the code and giving the error no place to hide. The gambler in each of us would rather use a risky approach that might find the error in five minutes than the sure-fire approach that will find the error in half an hour.

The risk is that if the five-minute approach doesn't work you get stubborn. Finding the error the "easy" way becomes a matter of principle, and hours pass unproductively.

When you decide to go for the quick victory, set a maximum time limit for trying the quick way. If you surpass the time limit, resolve yourself to the idea that the error is harder than you originally thought, and flush it out the hard way. This approach allows you to get the easy errors right away and the hard errors after a bit longer. Admittedly, you'll have a few errors you "would have" found in a few more minutes of quick-and-dirty debugging, but you'll never go home at the end of the day disgusted because you spent the whole day guessing rather than 30 minutes working sensibly.

Check for common errors. Use code-quality checklists to stimulate your thinking about possible errors. If you're following the inspection practices described in Section 24.2, you'll have your own fine-tuned checklist of the common problems in your environment. You can also use the checklists presented throughout this book. For a list of the checklists in this book, see the "List of Checklists" following the table of contents.

Talk to someone else about the problem. Some people call this "confessional debugging." You often discover your own error in the act of explaining it to another person. For example, if you were explaining the problem in the salary example, you might sound like this:

"Hey Jennifer. Have you got a minute; I'm having a problem. I've got this list of employee salaries that's supposed to be sorted but some names are out of order. They're sorted all right the second time I print them out but not the first. I checked to see if it was new names, but it didn't seem like it was because I tried some that worked. I know they should be sorted the first time I print them because the program sorts all the names as they're entered and again when they're saved ... wait a minute ... no, it doesn't sort them when they're entered. That's right. It only orders them roughly. Thanks Jennifer. You've been a big help."

Jennifer didn't say a word, and you solved your problem. This is typical, and is perhaps your most potent tool for solving the most difficult errors.

Take a break from the problem. Sometimes it's possible to concentrate so hard that you can't think. How many times have you paused for a cup of coffee and figured out the problem on your way to the coffee machine? Or in the middle of lunch? Or on the way home? If you're debugging and making no progress, once you've tried all the options, let it rest. Go for a walk. Work on something else. Let your subconscious mind tease a solution out of the problem.

The auxiliary benefit of giving up temporarily is that it reduces the anxiety associated with debugging. The onset of anxiety is a clear sign that it's time to take a break.

This material is Copyright 1993 by Steven C. McConnell. All Rights Reserved.

 

Email me at stevemcc@construx.com.