Code Complete: A Practical Handbook of Software Construction. Redmond, Wa.: Microsoft Press, 880 pages, 1993. Retail price: $35. ISBN: 1-55615-484-4. 

Buy Code Complete from Amazon.com.


28.2 Introduction to Code Tuning

What is the appeal of code tuning? It isn't the most effective way to improve performance. Program design, data structure selection, and algorithm selection usually produce more dramatic improvements. It isn't the easiest way to improve performance. Buying new hardware or a compiler with a better optimizer is easier. It isn't the cheapest way to improve performance. It takes more time to hand-tune code initially, and code-tuning increases the readability burden later.

Code tuning is appealing for several reasons. One is that it seems to defy the laws of nature. It's incredibly satisfying to take a routine that executes in 500 milliseconds, tweak a few lines, and reduce the execution speed to 100 milliseconds.

It's also appealing because mastering the art of writing efficient code is a rite of passage to becoming a serious programmer. Every specialized activity has similar abilities that you master as a rite of passage. In tennis, for example, you don't get any points for the way you pick up a tennis ball, but you still need to learn the right way to do it. You can't just lean over and pick it up with your hand. If you're good, you whack it with the head of your racket until it bounces waist high, then you catch it. Whacking it more than three times or not bouncing it the first time are all serious failings.

It doesn't really matter how you pick up a tennis ball. Within the tennis culture, however, the way you pick it up carries a certain cachet. Similarly, no one but you and other programmers usually care how tight your code is. Nonetheless, within the programming culture, writing micro-efficient code proves you're cool.

A final reason for code tuning's appeal is that programmers see the software world from code's point of view. From that point of view, "better" code means better software.

The problem with code tuning is that efficient code isn't necessarily "better" code, no matter how appealing it is. The irresistable appeal of efficient coding's siren song and the number of programming sailors who crash their programming ships on the siren's rocky shores call for a code-tuning safety-awareness program. That's the subject of the next few subsections.

Old Wives' Tales

The first safety lesson in code tuning is that much of what you've heard about code tuning is false; specifically, the following:

Reducing the lines of code in a high-level language improves the speed or size of the resulting machine code--False! Many programmers cling tenaciously to the belief that if they can write code in one or two lines, it will be the most efficient possible. Consider the following line that initializes a five-element array:

for i = 1 to 5 do a[ i ] = i

Would you guess that the line is faster or slower than the following five lines that do the same job?

a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5

If you follow the old "fewer lines are faster" dogma, you'll guess that the first code is faster because it has four fewer lines. Hah! Tests in Pascal and Basic showed that the second fragment is over 80 percent faster than the first. Here are the numbers:


              for-loop   Straight-code   Time      Performance
Language      time       time            savings   ratio
Pascal        0.660      0.110           83%       6:1
Basic         0.379      0.051           87%       7:1

Notes: Unless otherwise indicated, (1) Times in these tables are given in seconds and should be interpreted only as relative times; (2) Performance ratios are rounded; (3) Specific brands and versions of languages aren't given because performance characteristics vary from brand to brand, version to version, and with different compiler settings; (4) In some cases, comparisons between two languages aren't meaningful because the code has been benchmarked with incommensurate compiler settings or floating-point variables.

In addition to the dramatic speed improvement, in Pascal, the first fragment consumes 47 bytes whereas the second takes only 30, a size savings of 36 percent. This certainly doesn't imply the absurd conclusion that increasing the number of lines of high-level language code always improves speed or reduces size. It does imply that, in spite of the aesthetic appeal of writing something with the fewest lines of code, there's no predictable relationship between lines of code in a high-level language and a program's ultimate size and speed.

Certain operations are probably faster or smaller than others--False! There's no room for "probably" when you're talking about performance. You must always measure performance to know whether your changes helped or hurt your program. The rules of the game change every time you change languages, compilers, or even versions of compilers. What was true on one machine with one compiler is probably false on another machine with another compiler.

This phenomenon suggests several reasons not to improve performance by code tuning. If you want your program to be portable, techniques that improve performance in one environment can degrade it in others. If you change or upgrade compilers, the new compiler may automatically optimize code the way you were hand tuning it-your work will have been wasted. Even worse, your code tuning may defeat compiler optimizations that are designed to work with well-structured code.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." -- Donald Knuth

Optimize as you go--False! One theory is that if you strive to write the fastest and smallest possible code as you write each routine, your program will be fast and small. This approach creates a forest-for-the-trees situation in which programmers ignore significant global optimizations because they're so busy with micro-optimizations. Here are the main problems with optimizing as you go along:

  • It's almost impossible to identify performance bottlenecks before the program is completely working. Programmers almost never target the right area ahead of time. This leads to time spent optimizing the wrong area and a program with inferior performance because the time needed for effective optimization has been wasted.
  • In the rare case in which developers identify the bottlenecks correctly, they overkill the ones they've identified and allow other ones to become critical. Again, the ultimate effect is a reduction in performance. In contrast, optimizations done after the system is complete can identify each problem area and its relative importance so that time is used effectively.
  • Focusing on optimization during initial development detracts from achieving other program characteristics. Developers immerse themselves in algorithm analysis and arcane debates which in the end don't contribute much value to the user. Competing concerns like correctness, modularity, information hiding, and readability become secondary goals, even though performance is easier to retrofit than any of them are. Post-hoc performance work typically affects only about 5 percent of the code. Would you rather go back and do performance work on 5 percent of the code or readability work on 100 percent?

If the development time saved by implementing the simplest program is applied to optimizing the running program, the result will always be a faster running program than one in which optimization efforts are applied indiscriminately as the program is developed (Stevens 1981).

In short, premature optimization's primary problem is a lack of perspective, and final performance, the program and its users usually end up suffering.

You'll find a few projects in which post hoc optimization isn't sufficient to meet performance goals, and you have to make major changes in the completed code. In those cases, small, localized optimizations wouldn't have provided the needed gains anyway. The problem isn't inadequate code quality-it's inadequate software architecture.

If you need to optimize before the program is complete, minimize the risks by building perspective into your process. One way of doing so is to spec performance and speed goals for routines or subsystems, then optimize to meet the goals as you go along. Setting such goals in a spec is way to keep one eye on the forest while you figure out how big your particular tree is.

A fast program is just as important as a correct one--False! It's hardly ever true that programs need to be fast or small before they need to be correct. Gerald Weinberg tells the story of a programmer who was flown to Detroit to help debug a program that was in trouble. The programmer worked with the team of programmers who had developed the program, and after several days he concluded that the situation was hopeless.

On the flight home, he mulled over the last few days and realized the true problem. By the end of the flight, he had outlined the necessary code. He tested it for several days and was about to return to Detroit when he got a telegram saying the project had been cancelled because the program was impossible to write. He headed back to Detroit anyway and convinced the executives that the project could be completed.

He then had to convince the project's original programmers. They listened to his presentation, and when he'd finished, the creator of the old system asked,

"And how long does your program take?"

"That varies, but about ten seconds per input."

"Aha! But my program takes only one second per input." The veteran leaned back, satisfied that he'd stumped the upstart programmer. The other programmers seemed to agree, but the new programmer wasn't intimidated.

"Yes, but your program doesn't work. If mine doesn't have to work, I can make it run instantly and take up no memory. "

To summarize, for a certain class of projects, size or speed are a major risk. This class is the minority, and is much smaller than most people think. For these projects, up-front design must be done to properly address the performance risks. For other projects, premature optimization poses a significant threat to overall software quality, including performance.

The Pareto Principal

The Pareto principle, often known as the 80/20 rule, states that you can get 80 percent of the result with 20 percent of the effort. It applies to a lot of things other than programming, but it definitely applies to program optimization.

Barry Boehm reports that he has measured that 20 percent of the routines consume 80 percent of the execution time (Boehm 1987b). In the classic paper, "An Empirical Study of FORTRAN programs," Donald Knuth found that less than 4 percent of a program usually accounts for more than 50 percent of its run time (Knuth 1971).

Knuth used a line-count profiler to discover this surprising relationship, and the implications for optimization are clear. Measure the code to find the hot spots, then put your resources into optimizing the few percent that are used the most. Knuth profiled his line-count program and found that it was spending half the execution time in two loops. He changed a few lines of code and doubled the speed of the profiler in less than an hour.

Jon Bentley describes a case in which a thousand-line program spent 80 percent of its time in a five-line square-root routine. By tripling the speed of the square-root routine he doubled the speed of the program (Bentley 1988).

Bentley also reports the case of a team that discovered that half an operating system's time was spent in a small loop. They rewrote the loop in microcode and made the loop 10 times faster, but it didn't change the system's performance-they had rewritten the system's idle loop.

The team that designed the Algol language-the precursor to Pascal, C, and Ada and one of the most influential languages ever-received the following advice: The best is the enemy of the good. Working toward perfection may prevent completion. Complete it first, then perfect it. The part that needs to be perfect is usually small.

Measurement

Since small parts of a program tend to take a disproportionate share of the run time, measure the code to find the hot spots.

Once you've found the hot spots and optimized them, measure the code again to assess how much you've improved it. Many aspects of performance are counter intuitive. The case given earlier in which five lines of code were significantly faster and smaller than one line is an example of a way in which code can surprise you.

Experience doesn't help much with optimization. A person's experience might have come from an old machine, language, or compiler, and when any of those things changes, all bets are off. You can never be sure about the effect of an optimization until you measure it.

A few years ago I wrote a program which summed the elements in a matrix. The original code looked like this:

C Example of Straightforward Code to Sum the Elements in a Matrix

Sum = 0;
for ( Row = 0; Row < RowCount; Row++ )
   {
   for ( Column = 0; Column < ColumnCount; Column++ )
      {
      Sum = Sum + Matrix[ Row ][ Column ];
      }
   }

This code was straightforward. Performance of the matrix-summation routine was critical, however, and I knew all the array accesses and loop tests had to be expensive. I knew from computer-science classes that every time the code accessed a two-dimensional array, it performed expensive multiplications and additions. For a 10-by-10 matrix, that totaled 100 multiplications and additions plus the loop overhead. By converting to pointer notation, I reasoned, I could increment a pointer and exchange 100 relatively cheap increment operations for 100 expensive multiplications. Consequently, I carefully converted the code to pointer notation and got this:

C Example of an Attempt to Tune Code to Sum the Elements in a Matrix

Sum            = 0;
ElementPtr     = Matrix;
LastElementPtr = Matrix[ RowCount-1 ][ ColumnCount-1 ] + 1;
while ( ElementPtr < LastElementPtr )
   {
   Sum += *ElementPtr++;
   }

Even though the code isn't as readable as the first code, especially to programmers who aren't C experts, I was magnificently pleased with myself. For a 10-by-10 matrix, I calculated that I had saved 100 multiplications and a lot of loop overhead. I was so pleased, that I decided to measure the speed improvement, something I didn't always do back then, so that I could pat myself on the back more quantitatively.

Do you know what I found?

No improvement whatsoever. Not with a 10-by-10 matrix. Not with a 3-by-3 matrix. Not with a 25-by-25 matrix. The compiler's optimizer had optimized the first code well enough that my optimization didn't help at all. I was powerfully disappointed, and I learned that that only thing you can usually be sure of without measuring performance is that you've made your code harder to read.

In short, if it's not worth measuring to prove it's more efficient, then it's not worth sacrificing clarity for a performance gamble.

Measurements Need to be Precise

In a similar vein, performance measurements need to be precise. Timing your program with a stopwatch or by counting "one elephant, two elephant, three elephant" isn't precise enough.

Profiling tools are useful, or you can use your system's clock and routines that record the elapsed time for competing operations.

Whether you use someone else's tool or write code to make the measurements yourself, make sure that you're only measuring the execution time of the code you're tuning. If you're working on a multi-user or multi-tasking system, use the number of CPU clock ticks allocated to your program rather than the time of day. Otherwise, when the system swaps your program out and another one in, one of your routines is penalized for the time spent executing another program. Likewise, try to factor out testing overhead so that neither the original code nor the tuning attempt is unfairly penalized.

Compiler Optimizations

Modern compiler optimizations might be more powerful than you expect. As I described earlier, in one case my compiler did as good a job of optimizing a nested loop as I was able to do by rewriting the code in a supposedly more efficient style.

When shopping for a compiler, compare the performance of each compiler on your program. Each compiler has different strengths and weaknesses, and some are better suited to your program than others.

Optimizing compilers are better at optimizing straightforward code than they are at tricky code. If you do "clever" things like fooling around with loop indexes, your compiler can't do its job, and your code suffers. For an example, see the case in "Use Only One Statement per Line" in Section 18.5 in which a straightforward approach resulted in code that was 15 percent faster and 45 percent smaller than comparable "tricky" code.

With a good optimizing compiler, your code speed can improve 30 percent or more across the board. Many of the techniques described in the next chapter produce gains of only about 20 percent. Why not just write clear code and let the compiler do the work? Here are the results of a few tests to check how much an optimizer speeded up an insertion sort routine:


                  Time without     Time with 
                  compiler         compiler        Time      Performance
Language          optimizations    optimizations   savings   ratio
Ada               2.80             1.60            43%       2:1
C compiler 1      2.25             1.65            27%       4:3
C compiler 2      2.58             2.58            0%        1:1
C compiler 3      2.31             0.99            57%       2:1

The only difference between versions was that compiler optimization was turned off for the first compile, on for the second. Clearly, some compilers optimize better than others and you'll have to check your own compiler to see what effect it has. The range of differences in the three C compilers shows that you can't generalize about any particular language being more efficient than another. The Ada compiler produced faster code than two of the three C compilers, slower code than the third.

When to Tune

The germ of the guideline for when to optimize has already been given. Use a high-quality design. Make the program right. Make it modular and easily modifiable so that it's easy to work on later. When it's complete and correct, check the performance. If it's lacking, then make it fast and small. In short, don't optimize until you know you need to.

Iteration

Once you've identified a performance bottleneck, you may be amazed at how much you can improve performance by code tuning. You'll rarely get a 10-fold improvement from one technique, but you can effectively combine techniques, so keep trying, even after you find one that works.

Once I wrote a software implementation of the Data Encryption Standard, or DES. Actually, I didn't write it once, I wrote it about 30 times. DES is designed to encode digital data so that it can't be unscrambled without a password. The encryption algorithm is so convoluted that it seems like it might have been used on itself. The performance goal for my DES implementation was to encrypt an 18K file in 37 seconds on an original IBM PC. My first implementation executed in 21 minutes and 40 seconds, so I had a long row to hoe. Here's a history of the optimizations:

Optimization Benchmark Time Improvement


                                                     Benchmark   
Optimization                                         Time        Improvement
Initial, straightforward implementation              21:40       -
Convert from bit fields to arrays                    7:30        65%
Unroll innermost for loop                            6:00        20%
Remove final permutation                             5:24        10%
Combine two variables                                5:06        5%
Use a logical identity to combine the first          4:30        12%
two steps of the DES algorithm 
Make two variables share the same memory to          3:36        20%
reduce data shuttling in inner loop
Make two variables share the same memory to          3:09        13%
reduce data shuttling in outer loop
Unfold all loops and use literal array subscripts    1:36        49%
Remove routine calls and put all the code inline     0:45        53%
Rewrite the whole routine in assembler               0:22        51%
FINAL                                                0:22        98%

Note: The steady progress of optimizations in this table doesn't imply that all optimizations work. I haven't shown all the things I tried that doubled the run time.

Even though most individual optimizations were small, cumulatively they were significant. Based on the percentage improvements, no three or even four optimizations would have met my performance goal. But the final combination was effective. The morale of the story is that if you dig deep enough, you can make some surprising gains.

The code tuning I did in this case is the most aggressive code tuning I've ever done. At the same time, the final code is the most unreadable, unmaintainable code I've ever written. The initial algorithm is complicated. The code resulting from the high-level language transformation was barely readable. The translation to assembler produced a single 500-line routine that I'm afraid to look at. In general, this relationship between code tuning and code quality holds true.

This material is Copyright 1993 by Steven C. McConnell. All Rights Reserved.

 

Email me at stevemcc@construx.com.