 |
Code Complete: A Practical Handbook of Software
Construction. Redmond, Wa.: Microsoft Press, 880 pages,
1993. Retail price: $35. ISBN: 1-55615-484-4.
Buy
Code Complete from Amazon.com. |
28.2 Introduction to Code Tuning
What is the appeal of code tuning? It isn't the most effective way to improve
performance. Program design, data structure selection, and algorithm selection usually
produce more dramatic improvements. It isn't the easiest way to improve performance.
Buying new hardware or a compiler with a better optimizer is easier. It isn't the cheapest
way to improve performance. It takes more time to hand-tune code initially, and
code-tuning increases the readability burden later.
Code tuning is appealing for several reasons. One is that it seems to defy the laws of
nature. It's incredibly satisfying to take a routine that executes in 500 milliseconds,
tweak a few lines, and reduce the execution speed to 100 milliseconds.
It's also appealing because mastering the art of writing efficient code is a rite of
passage to becoming a serious programmer. Every specialized activity has similar abilities
that you master as a rite of passage. In tennis, for example, you don't get any points for
the way you pick up a tennis ball, but you still need to learn the right way to do it. You
can't just lean over and pick it up with your hand. If you're good, you whack it with the
head of your racket until it bounces waist high, then you catch it. Whacking it more than
three times or not bouncing it the first time are all serious failings.
It doesn't really matter how you pick up a tennis ball. Within the tennis culture,
however, the way you pick it up carries a certain cachet. Similarly, no one but you and
other programmers usually care how tight your code is. Nonetheless, within the programming
culture, writing micro-efficient code proves you're cool.
A final reason for code tuning's appeal is that programmers see the software world from
code's point of view. From that point of view, "better" code means better
software.
The problem with code tuning is that efficient code isn't necessarily
"better" code, no matter how appealing it is. The irresistable appeal of
efficient coding's siren song and the number of programming sailors who crash their
programming ships on the siren's rocky shores call for a code-tuning safety-awareness
program. That's the subject of the next few subsections.
Old Wives' Tales
The first safety lesson in code tuning is that much of what you've heard about code
tuning is false; specifically, the following:
Reducing the lines of code in a high-level language improves the speed or size of
the resulting machine code--False! Many programmers cling tenaciously to the belief
that if they can write code in one or two lines, it will be the most efficient possible.
Consider the following line that initializes a five-element array:
for i = 1 to 5 do a[ i ] = i
Would you guess that the line is faster or slower than the following five lines that do
the same job?
a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5
If you follow the old "fewer lines are faster" dogma, you'll guess that the
first code is faster because it has four fewer lines. Hah! Tests in Pascal and Basic
showed that the second fragment is over 80 percent faster than the first. Here are the
numbers:
for-loop Straight-code Time Performance
Language time time savings ratio
Pascal 0.660 0.110 83% 6:1
Basic 0.379 0.051 87% 7:1
Notes: Unless otherwise indicated, (1) Times in these tables are given in seconds and
should be interpreted only as relative times; (2) Performance ratios are rounded; (3)
Specific brands and versions of languages aren't given because performance characteristics
vary from brand to brand, version to version, and with different compiler settings; (4) In
some cases, comparisons between two languages aren't meaningful because the code has been
benchmarked with incommensurate compiler settings or floating-point variables.
In addition to the dramatic speed improvement, in Pascal, the first fragment consumes
47 bytes whereas the second takes only 30, a size savings of 36 percent. This certainly
doesn't imply the absurd conclusion that increasing the number of lines of high-level
language code always improves speed or reduces size. It does imply that, in spite of the
aesthetic appeal of writing something with the fewest lines of code, there's no
predictable relationship between lines of code in a high-level language and a program's
ultimate size and speed.
Certain operations are probably faster or smaller than others--False! There's no
room for "probably" when you're talking about performance. You must always
measure performance to know whether your changes helped or hurt your program. The rules of
the game change every time you change languages, compilers, or even versions of compilers.
What was true on one machine with one compiler is probably false on another machine with
another compiler.
This phenomenon suggests several reasons not to improve performance by code tuning. If
you want your program to be portable, techniques that improve performance in one
environment can degrade it in others. If you change or upgrade compilers, the new compiler
may automatically optimize code the way you were hand tuning it-your work will have been
wasted. Even worse, your code tuning may defeat compiler optimizations that are designed
to work with well-structured code.
"We should forget about small efficiencies, say about 97% of the time: premature
optimization is the root of all evil." -- Donald Knuth
Optimize as you go--False! One theory is that if you strive to write the fastest
and smallest possible code as you write each routine, your program will be fast and small.
This approach creates a forest-for-the-trees situation in which programmers ignore
significant global optimizations because they're so busy with micro-optimizations. Here
are the main problems with optimizing as you go along:
- It's almost impossible to identify performance bottlenecks before the program is
completely working. Programmers almost never target the right area ahead of time.
This leads to time spent optimizing the wrong area and a program with inferior performance
because the time needed for effective optimization has been wasted.
- In the rare case in which developers identify the bottlenecks correctly, they overkill
the ones they've identified and allow other ones to become critical. Again, the ultimate
effect is a reduction in performance. In contrast, optimizations done after the system is
complete can identify each problem area and its relative importance so that time is used
effectively.
- Focusing on optimization during initial development detracts from achieving other
program characteristics. Developers immerse themselves in algorithm analysis and arcane
debates which in the end don't contribute much value to the user. Competing concerns like
correctness, modularity, information hiding, and readability become secondary goals, even
though performance is easier to retrofit than any of them are. Post-hoc performance work
typically affects only about 5 percent of the code. Would you rather go back and do
performance work on 5 percent of the code or readability work on 100 percent?
If the development time saved by implementing the simplest program is applied to
optimizing the running program, the result will always be a faster running program than
one in which optimization efforts are applied indiscriminately as the program is developed
(Stevens 1981).
In short, premature optimization's primary problem is a lack of perspective, and final
performance, the program and its users usually end up suffering.
You'll find a few projects in which post hoc optimization isn't sufficient to meet
performance goals, and you have to make major changes in the completed code. In those
cases, small, localized optimizations wouldn't have provided the needed gains anyway. The
problem isn't inadequate code quality-it's inadequate software architecture.
If you need to optimize before the program is complete, minimize the risks by building
perspective into your process. One way of doing so is to spec performance and speed goals
for routines or subsystems, then optimize to meet the goals as you go along. Setting such
goals in a spec is way to keep one eye on the forest while you figure out how big your
particular tree is.
A fast program is just as important as a correct one--False! It's hardly ever
true that programs need to be fast or small before they need to be correct. Gerald
Weinberg tells the story of a programmer who was flown to Detroit to help debug a program
that was in trouble. The programmer worked with the team of programmers who had developed
the program, and after several days he concluded that the situation was hopeless.
On the flight home, he mulled over the last few days and realized the true problem. By
the end of the flight, he had outlined the necessary code. He tested it for several days
and was about to return to Detroit when he got a telegram saying the project had been
cancelled because the program was impossible to write. He headed back to Detroit anyway
and convinced the executives that the project could be completed.
He then had to convince the project's original programmers. They listened to his
presentation, and when he'd finished, the creator of the old system asked,
"And how long does your program take?"
"That varies, but about ten seconds per input."
"Aha! But my program takes only one second per input." The veteran
leaned back, satisfied that he'd stumped the upstart programmer. The other programmers
seemed to agree, but the new programmer wasn't intimidated.
"Yes, but your program doesn't work. If mine doesn't have to work, I can
make it run instantly and take up no memory. "
To summarize, for a certain class of projects, size or speed are a major risk. This
class is the minority, and is much smaller than most people think. For these projects,
up-front design must be done to properly address the performance risks. For other
projects, premature optimization poses a significant threat to overall software quality,
including performance.
The Pareto Principal
The Pareto principle, often known as the 80/20 rule, states that you can get 80 percent
of the result with 20 percent of the effort. It applies to a lot of things other than
programming, but it definitely applies to program optimization.
Barry Boehm reports that he has measured that 20 percent of the routines consume 80
percent of the execution time (Boehm 1987b). In the classic paper, "An Empirical
Study of FORTRAN programs," Donald Knuth found that less than 4 percent of a program
usually accounts for more than 50 percent of its run time (Knuth 1971).
Knuth used a line-count profiler to discover this surprising relationship, and the
implications for optimization are clear. Measure the code to find the hot spots, then put
your resources into optimizing the few percent that are used the most. Knuth profiled his
line-count program and found that it was spending half the execution time in two loops. He
changed a few lines of code and doubled the speed of the profiler in less than an hour.
Jon Bentley describes a case in which a thousand-line program spent 80 percent of its
time in a five-line square-root routine. By tripling the speed of the square-root routine
he doubled the speed of the program (Bentley 1988).
Bentley also reports the case of a team that discovered that half an operating system's
time was spent in a small loop. They rewrote the loop in microcode and made the loop 10
times faster, but it didn't change the system's performance-they had rewritten the
system's idle loop.
The team that designed the Algol language-the precursor to Pascal, C, and Ada and one
of the most influential languages ever-received the following advice: The best is the
enemy of the good. Working toward perfection may prevent completion. Complete it first,
then perfect it. The part that needs to be perfect is usually small.
Measurement
Since small parts of a program tend to take a disproportionate share of the run time,
measure the code to find the hot spots.
Once you've found the hot spots and optimized them, measure the code again to assess
how much you've improved it. Many aspects of performance are counter intuitive. The case
given earlier in which five lines of code were significantly faster and smaller than one
line is an example of a way in which code can surprise you.
Experience doesn't help much with optimization. A person's experience might have come
from an old machine, language, or compiler, and when any of those things changes, all bets
are off. You can never be sure about the effect of an optimization until you measure it.
A few years ago I wrote a program which summed the elements in a matrix. The original
code looked like this:
C Example of Straightforward Code to Sum the Elements in a Matrix
Sum = 0;
for ( Row = 0; Row < RowCount; Row++ )
{
for ( Column = 0; Column < ColumnCount; Column++ )
{
Sum = Sum + Matrix[ Row ][ Column ];
}
}
This code was straightforward. Performance of the matrix-summation routine was
critical, however, and I knew all the array accesses and loop tests had to be
expensive. I knew from computer-science classes that every time the code accessed a
two-dimensional array, it performed expensive multiplications and additions. For a
10-by-10 matrix, that totaled 100 multiplications and additions plus the loop overhead. By
converting to pointer notation, I reasoned, I could increment a pointer and exchange 100
relatively cheap increment operations for 100 expensive multiplications. Consequently, I
carefully converted the code to pointer notation and got this:
C Example of an Attempt to Tune Code to Sum the Elements in a Matrix
Sum = 0;
ElementPtr = Matrix;
LastElementPtr = Matrix[ RowCount-1 ][ ColumnCount-1 ] + 1;
while ( ElementPtr < LastElementPtr )
{
Sum += *ElementPtr++;
}
Even though the code isn't as readable as the first code, especially to programmers who
aren't C experts, I was magnificently pleased with myself. For a 10-by-10 matrix, I
calculated that I had saved 100 multiplications and a lot of loop overhead. I was so
pleased, that I decided to measure the speed improvement, something I didn't always do
back then, so that I could pat myself on the back more quantitatively.
Do you know what I found?
No improvement whatsoever. Not with a 10-by-10 matrix. Not with a 3-by-3 matrix. Not
with a 25-by-25 matrix. The compiler's optimizer had optimized the first code well enough
that my optimization didn't help at all. I was powerfully disappointed, and I learned that
that only thing you can usually be sure of without measuring performance is that you've
made your code harder to read.
In short, if it's not worth measuring to prove it's more efficient, then it's not worth
sacrificing clarity for a performance gamble.
Measurements Need to be Precise
In a similar vein, performance measurements need to be precise. Timing your program
with a stopwatch or by counting "one elephant, two elephant, three elephant"
isn't precise enough.
Profiling tools are useful, or you can use your system's clock and routines that record
the elapsed time for competing operations.
Whether you use someone else's tool or write code to make the measurements yourself,
make sure that you're only measuring the execution time of the code you're tuning. If
you're working on a multi-user or multi-tasking system, use the number of CPU clock ticks
allocated to your program rather than the time of day. Otherwise, when the system swaps
your program out and another one in, one of your routines is penalized for the time spent
executing another program. Likewise, try to factor out testing overhead so that neither
the original code nor the tuning attempt is unfairly penalized.
Compiler Optimizations
Modern compiler optimizations might be more powerful than you expect. As I described
earlier, in one case my compiler did as good a job of optimizing a nested loop as I was
able to do by rewriting the code in a supposedly more efficient style.
When shopping for a compiler, compare the performance of each compiler on your program.
Each compiler has different strengths and weaknesses, and some are better suited to your
program than others.
Optimizing compilers are better at optimizing straightforward code than they are at
tricky code. If you do "clever" things like fooling around with loop indexes,
your compiler can't do its job, and your code suffers. For an example, see the case in
"Use Only One Statement per Line" in Section 18.5 in which a straightforward
approach resulted in code that was 15 percent faster and 45 percent smaller than
comparable "tricky" code.
With a good optimizing compiler, your code speed can improve 30 percent or more across
the board. Many of the techniques described in the next chapter produce gains of only
about 20 percent. Why not just write clear code and let the compiler do the work? Here are
the results of a few tests to check how much an optimizer speeded up an insertion sort
routine:
Time without Time with
compiler compiler Time Performance
Language optimizations optimizations savings ratio
Ada 2.80 1.60 43% 2:1
C compiler 1 2.25 1.65 27% 4:3
C compiler 2 2.58 2.58 0% 1:1
C compiler 3 2.31 0.99 57% 2:1
The only difference between versions was that compiler optimization was turned off for
the first compile, on for the second. Clearly, some compilers optimize better than others
and you'll have to check your own compiler to see what effect it has. The range of
differences in the three C compilers shows that you can't generalize about any particular language
being more efficient than another. The Ada compiler produced faster code than two of the
three C compilers, slower code than the third.
When to Tune
The germ of the guideline for when to optimize has already been given. Use a
high-quality design. Make the program right. Make it modular and easily modifiable so that
it's easy to work on later. When it's complete and correct, check the performance. If it's
lacking, then make it fast and small. In short, don't optimize until you know you need to.
Iteration
Once you've identified a performance bottleneck, you may be amazed at how much you can
improve performance by code tuning. You'll rarely get a 10-fold improvement from one
technique, but you can effectively combine techniques, so keep trying, even after you find
one that works.
Once I wrote a software implementation of the Data Encryption Standard, or DES.
Actually, I didn't write it once, I wrote it about 30 times. DES is designed to encode
digital data so that it can't be unscrambled without a password. The encryption algorithm
is so convoluted that it seems like it might have been used on itself. The performance
goal for my DES implementation was to encrypt an 18K file in 37 seconds on an original IBM
PC. My first implementation executed in 21 minutes and 40 seconds, so I had a long row to
hoe. Here's a history of the optimizations:
Optimization Benchmark Time Improvement
Benchmark
Optimization Time Improvement
Initial, straightforward implementation 21:40 -
Convert from bit fields to arrays 7:30 65%
Unroll innermost for loop 6:00 20%
Remove final permutation 5:24 10%
Combine two variables 5:06 5%
Use a logical identity to combine the first 4:30 12%
two steps of the DES algorithm
Make two variables share the same memory to 3:36 20%
reduce data shuttling in inner loop
Make two variables share the same memory to 3:09 13%
reduce data shuttling in outer loop
Unfold all loops and use literal array subscripts 1:36 49%
Remove routine calls and put all the code inline 0:45 53%
Rewrite the whole routine in assembler 0:22 51%
FINAL 0:22 98%
Note: The steady progress of optimizations in this table doesn't imply that all
optimizations work. I haven't shown all the things I tried that doubled the run time.
Even though most individual optimizations were small, cumulatively they were
significant. Based on the percentage improvements, no three or even four optimizations
would have met my performance goal. But the final combination was effective. The morale of
the story is that if you dig deep enough, you can make some surprising gains.
The code tuning I did in this case is the most aggressive code tuning I've ever done.
At the same time, the final code is the most unreadable, unmaintainable code I've ever
written. The initial algorithm is complicated. The code resulting from the high-level
language transformation was barely readable. The translation to assembler produced a
single 500-line routine that I'm afraid to look at. In general, this relationship between
code tuning and code quality holds true.
This material is Copyright © 1993 by Steven C. McConnell. All
Rights Reserved.
|