|
Debugging consists of two parts: finding the error and fixing it. Finding the error
(and understanding it) is usually 90 percent of the work.
Fortunately, you don't have to make a pact with Satan in order to find an approach to
debugging that's better than random guessing. Contrary to what the devil wants you to
believe, debugging by thinking about the problem is much more effective and
interesting than debugging with an eye of newt and the dust of a frog's ear.
Suppose you were asked to solve a murder mystery. Which would be more interesting:
going door to door throughout the county, checking every person's alibi for the night of
October 17? Or finding a few clues and deducing the murderer's identity? Most people would
rather deduce the person's identity, and most programmers find the intellectual approach
to debugging more satisfying. Even better, the effective programmers who debug in 1/20 of
the time of the ineffective programmers aren't randomly guessing about how to fix the
program. They're using the scientific method.
The Scientific Method of Debugging
Here are the elements of the scientific method:
1. Gather data through repeatable experiments.
2. Form a hypothesis using as much of the relevant data as possible.
3. Design an experiment to prove or disprove the hypothesis.
4. Prove or disprove the hypothesis.
5. Repeat as needed.
This process has many parallels in debugging. Here's an effective approach for finding
an error:
1. Stabilize the error.
2. Locate the source of the error.
3. Fix the error
4. Test the fix
5. Look for similar errors.
The first step is similar to the scientific method in that it relies on repeatability.
The defect is easier to diagnose if you can make it occur reliably. The second step uses
all the steps of the scientific method. Gather the test data that divulged the error.
Analyze the data that has been produced, and form a hypothesis about the source of the
error. Design a test case or inspection to evaluate the hypothesis, then declare success
or renew your efforts, as appropriate.
Let's look at each of the steps in more detail.
An Example
For purposes of the following discussion, assume that you have an employee database
program with an intermittent error. The program is supposed to print a list of employees
and income tax withholdings in alphabetical order. Here's part of the output:
Formatting, Fred Freeform $5,877
Goto, Gary $1,666
Modula, Mildred $10,788
Many-Loop, Mavis $8,889
Statement, Sue Switch $4,000
Whileloop, Wendy $7,860
Stabilize the Error
If you can't make a defect occur reliably, it's almost impossible to diagnose. Making
an intermittent defect occur predictably is one of the most challenging tasks in
debugging.
An error that doesn't occur predictably is usually an initialization error or a
dangling pointer problem. If the problem is that sometimes the sum of a calculation is
right and sometimes it isn't, probably a variable involved in the calculation isn't being
initialized properly-most of the time it just happens to start at 0. If the problem is a
strange and unpredictable phenomenon and you're using pointers, you almost certainly have
an unitialized pointer or are using a pointer after the memory that it points to has been
deallocated. Suggestions for minimizing pointer problems are given in Section 11.9.
Stabilizing the error usually requires more than finding a test case that produces the
error. It includes narrowing the test case to the simplest one that still produces the
error. If you work in an organization that has an independent test team, sometimes it's
their job to make the test cases simple. Most of the time, it's your job.
In simplifying the test case, the scientific method again comes into play. Suppose you
have ten factors that, used in combination, produced the error. Form a hypothesis about
the factors that were irrelevant in producing the error. Change the supposedly irrelevant
factors, and rerun the test case. If you still get the error, you can eliminate those
factors and you've simplified the test. You can then try to simplify it further. If you
don't get the error, you've disproven that specific hypothesis, and you know more than you
did before. It might be that some subtly different change would still produce the error,
but you know at least one specific change that does not.
For example, in the employee-withholdings example given above, Many-Loop, Mavis
is listed after Modula, Mildred, which is out of alphabetical order. When program
is run a second time, however, the listing changes:
Formatting, Fred Freeform $5,877
Goto, Gary $1,666
Many-Loop, Mavis $8,889
Modula, Mildred $10,788
Statement, Sue Switch $4,000
Whileloop, Wendy $7,860
The list is now correct. It isn't until Fruit-Loop, Frita is entered and shows
up in an incorrect position that you remember that Modula, Mildred had been entered
right before she showed up in the wrong spot, too. What's odd about both cases is that
they were entered singly. Usually employees are entered in groups.
Hypothesis: The problem has something to do with entering a single new employee.
If this is true, running the programming again should put Fruit-Loop, Frita in
the right order. Here's the result of a second run:
Formatting, Fred Freeform $5,877
Fruit-Loop, Frita $5,771
Goto, Gary $1,666
Many-Loop, Mavis $8,889
Modula, Mildred $10,788
Statement, Sue Switch $4,000
Whileloop, Wendy $7,860
This supports the hypothesis. To confirm it, you want to try adding a few new
employees, one at a time, to see if they show up in the right order.
Locate the Source of the Error
The goal of simplifying the test case is to make it so simple that changing any aspect
of it changes the behavior of the error. Then, by changing the test case carefully and
watching its behavior under controlled conditions, you can diagnose the problem by
watching the program.
Locating the source of the error also uses the scientific method. You might suspect
that the defect is a result of a specific problem, say an off-by-one error. You could then
vary the parameter you suspect is causing the problem-one below the boundary, on the
boundary, and one above the boundary-and determine whether your hypothesis is correct.
In the running example, the source of the problem could be an off-by-one error that
occurs when you add one new employee, but not when you add two or more. Examining the
code, you don't find an obvious off-by-one error. Resorting to Plan B, you run a test case
with a single new employee to see if that's the problem. You add Hardcase, Henry as
a single employee and hypothesize that his record will be out of order. Here's what you
find:
Formatting, Fred Freeform $5,877
Fruit-Loop, Frita $5,771
Goto, Gary $1,666
Hardcase, Henry $493
Many-Loop, Mavis $8,889
Modula, Mildred $10,788
Statement, Sue Switch $4,000
Whileloop, Wendy $7,860
The line for Hardcase, Henry is exactly where it should be, showing that the
first hypothesis is false. The problem isn't caused simply adding one employee at a time.
It's either more complicated or something completely different.
Examining the test-run output again, you notice that Fruit-Loop, Frita and Many-Loop,
Mavis are the only names with hyphens. Fruit-Loop was out of order when she was
first entered, but Many-Loop wasn't, was she? You don't have a printout from her
original entry. But in the original error Modula, Mildred appeared to be out of
order, but she was next to Many-Loop. Maybe Many-Loop was out of order and Modula
was all right.
Hypothesis: The problem arises from names with hyphens, not names that are entered
singly.
But how does that account for the fact that the problem shows up only the first time an
employee is entered? You look at the code and find that two different sorting routines are
used. One is used when an employee is entered, and another is used when the data is saved.
A closer look at the routine used when an employee is first entered shows that it isn't
supposed to sort the data completely. It only puts the data in approximate order to speed
up the save-routine's sorting. Thus, the problem is that the data is printed without being
sorted. The problem with hyphenated names arises because the rough-sort routine doesn't
handle niceties like punctuation characters. Thus, you can refine the hypothesis even
further:
Hypothesis: Names with punctuation characters aren't sorted correctly until they're
saved.
You later confirm this hypothesis with additional test cases.
Tips for Finding Errors
Once you've stabilized the error and refined the test case that produces it, finding
its source can be either trivial or challenging, depending on how well you've written your
code. If you're having a hard time finding an error, it's probably because the code isn't
well written. You might not want to hear that, but it's true. If you're having trouble,
consider the following tips.
Use all the data available to make your hypothesis. When creating a hypothesis
about the source of a defect, account for as much of the data as you can in your
hypothesis. In the earlier example, you might have noticed that Fruitloop, Frita
was out of order and made a hypothesis that names beginning with an "F" are
sorted incorrectly. That's a poor hypothesis because it doesn't account for the fact that Modula,
Mildred was out of order or that names are sorted correctly the second time around. If
data doesn't fit the hypothesis, don't discard the data-ask why it doesn't fit, and
create a new hypothesis.
On the other hand, the second hypothesis in the example, that the problem arises from
names with hyphens, not names that are entered singly, didn't seem initially to account
for the fact that names were sorted correctly the second time around, either. In this
case, however, the hypothesis led to a more refined hypothesis which proved to be correct.
It's all right not to account for all of the data at first as long as the hypothesis is
refined so that it does eventually.
Refine the test cases that produce the error. If you can't find the source of an
error, try to refine the test cases further than you already have. You might be able to
vary one parameter more than you had assumed, and focusing on one of the parameters may
provide the crucial breakthrough.
Reproduce the error several different ways. Sometimes trying things that are
similar to the error-producing case, but not exactly the same, is instructive. Think of it
in terms of triangulating the error. If you can get a fix on it from one point, and a fix
on it from another, then you can determine exactly where it is.
Try to reproduce an error several different ways to determine the exact cause of the
error.
Reproducing the error several different ways helps diagnose the cause of the error.
Once you think you've identified the error, run a case that's close to the cases that
produce errors but which should not produce an error itself. If it does produce an error,
you don't completely understand the problem yet. Errors often arise from combinations of
factors, and diagnosing the problem with one test case sometimes doesn't diagnose the root
problem.
Generate more data to generate more hypotheses. Choose test cases that are
different from the test cases you already know to be erroneous or correct. Run them to
generate more data, and use the new data to add to your list of possible hypotheses.
Use results of negative tests. Suppose you make a hypothesis and run a test case
to prove it. Suppose the test case disproves it, so that you still don't know where
the error is. You still know something you didn't before, namely that the error is not
in the area you thought it was. That narrows your search field and the set of possible
hypotheses.
Brainstorm for possible hypotheses. Rather than limiting yourself to the first
hypothesis you think of, try to come up with several. Don't analyze them at first, just
come up with as many as you can in a few minutes. Then look at each of them and think
about test cases that would prove or disprove them. This mental exercise is helpful in
breaking the debugging logjam that results from concentrating on a single line of
reasoning.
Narrow the suspicious region of the code. If you're testing the whole program,
or a whole module or routine, test a smaller part instead. Systematically remove parts of
the program, and see if the error still occurs. If it does, you know it's in the part you
took away. If it doesn't, you know it's in the part you've kept.
Rather than removing regions haphazardly, divide and conquer. Use a binary search
algorithm to focus your search. Try to remove about half the code the first time.
Determine the half the error is in, then chop that section in half. Again, determine the
half the error is in, and again, chop that section in half. Continue until you find the
error.
If you make use many small routines, you'll be able to chop out sections of code simply
by commenting out calls to the routines. Otherwise, you can use comments or preprocessor
commands to remove code.
If you're using a debugger, you don't necessarily have to remove pieces of code. Set a
breakpoint partway through the program and check for the error that way instead. If your
debugger allows you to skip calls to routines, eliminate suspects by not executing certain
routines and seeing if the error still occurs. The process with a debugger is otherwise
similar to the one in which pieces of a program are physically removed.
Check code that's changed recently. If you have a new error that's hard to find,
it's usually related to code that's changed recently. It could be in completely new code
or in changes to old code. If you can't find an error, run an old version of the problem
and see if the error occurs. If it doesn't, you know the error's in the new version or
caused by an interaction with the new version. Compare the differences between the old and
new versions and scrutinize the differences.
Expand the suspicious region of the code. It's easy to focus too much on a small
section of code, saying "the error must be in this section." If you don't
find it in the section, consider the possibility that the error isn't in the section.
Expand the area of code you view with suspicion, then focus on pieces of it using the
binary search technique described above.
Integrate incrementally. As described in Chapter 27 on integration, debugging is
easy if you add pieces to a system one at a time. If you add a piece to a system and
encounter a new error, remove the piece and test it separately. Strap on a test harness
and exercise the routine by itself to determine what's wrong.
Be suspicious of routines that have had errors before. Contrary to common
intuition, routines that have had errors before will continue to have errors. A routine
that has been troublesome in the past is more likely to contain a new error than a routine
that has been error-free. Re-examine error-prone routines.
Use brute force. If you've used incremental integration and a new error raises
its ugly head, you'll have a small section of code in which to check for the error. It's
sometimes tempting to run the integrated code to find the error, rather than
dis-integrating the code and checking the new routine by itself. Running a test case
through the integrated system, however, may require a few minutes whereas running one
through the code you're integrating takes only a few seconds. If you don't find the error
by running the whole system on the first or second time, bite the bullet, dis-integrate
the code, and debug the new code separately.
Set a maximum time for quick and dirty debugging. This is related to the
previous point, but it's more general. It's always tempting to try for a quick guess
rather than sytematically instrumenting the code and giving the error no place to hide.
The gambler in each of us would rather use a risky approach that might find the
error in five minutes than the sure-fire approach that will find the error in half
an hour.
The risk is that if the five-minute approach doesn't work you get stubborn. Finding the
error the "easy" way becomes a matter of principle, and hours pass
unproductively.
When you decide to go for the quick victory, set a maximum time limit for trying the
quick way. If you surpass the time limit, resolve yourself to the idea that the error is
harder than you originally thought, and flush it out the hard way. This approach allows
you to get the easy errors right away and the hard errors after a bit longer. Admittedly,
you'll have a few errors you "would have" found in a few more minutes of
quick-and-dirty debugging, but you'll never go home at the end of the day disgusted
because you spent the whole day guessing rather than 30 minutes working sensibly.
Check for common errors. Use code-quality checklists to stimulate your thinking
about possible errors. If you're following the inspection practices described in Section
24.2, you'll have your own fine-tuned checklist of the common problems in your
environment. You can also use the checklists presented throughout this book. For a list of
the checklists in this book, see the "List of Checklists" following the table of
contents.
Talk to someone else about the problem. Some people call this "confessional
debugging." You often discover your own error in the act of explaining it to another
person. For example, if you were explaining the problem in the salary example, you might
sound like this:
"Hey Jennifer. Have you got a minute; I'm having a problem. I've got this list of
employee salaries that's supposed to be sorted but some names are out of order. They're
sorted all right the second time I print them out but not the first. I checked to see if
it was new names, but it didn't seem like it was because I tried some that worked. I know
they should be sorted the first time I print them because the program sorts all the names
as they're entered and again when they're saved ... wait a minute ... no, it doesn't sort
them when they're entered. That's right. It only orders them roughly. Thanks
Jennifer. You've been a big help."
Jennifer didn't say a word, and you solved your problem. This is typical, and is
perhaps your most potent tool for solving the most difficult errors.
Take a break from the problem. Sometimes it's possible to concentrate so hard
that you can't think. How many times have you paused for a cup of coffee and figured out
the problem on your way to the coffee machine? Or in the middle of lunch? Or on the way
home? If you're debugging and making no progress, once you've tried all the options, let
it rest. Go for a walk. Work on something else. Let your subconscious mind tease a
solution out of the problem.
The auxiliary benefit of giving up temporarily is that it reduces the anxiety
associated with debugging. The onset of anxiety is a clear sign that it's time to take a
break.
This material is Copyright © 1993 by Steven C. McConnell. All
Rights Reserved.
|