CLR Inside Out
Measure Early and Often for Performance, Part 1
Code download available at:CLRInsideOut2008_04.exe(205 KB)
Have a Plan
Profile Data Processing Example
Measure Early and Measure Often
The Logistics of Measuring
Validating Your Performance Results
Validating Micro-Benchmark Data with a Debugger
As a Performance Architect on the Microsoft® .NET Framework Common Language Runtime team, it is my job to help people best utilize the runtime to write high-performance applications. The truth of the matter is that there is no mystery to this, .NET or otherwise—you just have to design applications for performance from the start. Too many applications are written with almost no thought given to performance at all. Often that's not a problem because most programs do relatively little computation, and they are much faster than the humans with which they interact. Unfortunately, when the need for high performance does present itself, we simply don't have the knowledge, skills, and tools to do a good job.
Here I'll discuss what you need to write high-performance applications. While the concepts are universal, I'll focus here on programs written for .NET. Because .NET abstracts the underlying machine more than a typical C++ compiler, and because .NET provides powerful but expensive features including reflection, custom attributes, regular expressions, and so forth, it is much easier to unwittingly inject expensive operations into a performance-critical code path. To help you avoid that expense, I'll show how to quantify the expense of various .NET features so you know when it's appropriate to use them.
Have a Plan
As I mentioned, most programs are written without much thought given to performance, but every project should have a performance plan. You must consider various user scenarios and articulate what excellent, good, and bad performance actually mean. Then, based on data volume, algorithmic complexity, and any previous experience building similar applications, you must decide if you can easily meet whatever performance goals you've defined. For many GUI applications, the performance goals are modest, so it's easy to achieve at least good performance without any special design. If this is the case, your performance plan is done.
If you don't know if you can easily meet your performance goals, you'll need to begin writing a plan by listing the areas likely to be bottlenecks. Typical problem areas include startup time, bulk data operations, and graphics animations.
Profile Data Processing Example
An example will make this more concrete. I'm currently designing the .NET infrastructure for processing profile data. I need to present a list of events (page faults, disk I/O, context switches, and so on) generated by the OS in a meaningful way. The data files involved tend to be large; small profiles are in the neighborhood of 10MB and file sizes well over 1GB are not unusual.
While working on forming my performance plan, I concluded that the display of the data would not be problematic if I computed only the parts of the dataset that were needed to paint the display; in other words, if display was "lazy." Unfortunately, it takes extra work to make GUI objects like tree controls, list controls, and textboxes lazy. This is why most text editors have unacceptable performance when file sizes get too large (say, for instance, 100MB). If I had designed the GUI without thinking about performance, the result would have almost certainly been unacceptable.
Laziness, however, does not help for operations that need to use all the data in the file (when computing summaries, for example). Given the dataset size, the data dispatch and processing methods are "hot" code paths that must be designed carefully. Most of the rest of the program is unlikely to be performance-critical and needs no special attention.
This experience is typical. Even in high-performance scenarios, 95 percent of the application does not need any performance planning, but you need to carefully identify the last 5 percent that does. Also, as in my case, it is usually pretty easy to determine which parts of the program are likely to be that 5 percent that matters.
Measure Early and Measure Often
The next step in high-performance design is to measure—before writing a line of code, you need to know whether your performance goals are even possible, and if so, what constraints they place on the design. In my case, I need to know the costs of basic operations being considered in my design, such as raw file I/O and database access. To proceed, I need some numbers. This is the most critical time in the design of the project.
Sadly, most performance is lost very early in the development process. By the time you have chosen the data structures for the heart of your program, the application's performance profile has been set in stone. Choosing your algorithms further limits performance. Selecting interface contracts between various sub-components constrains performance still further. It is critical that you understand the costs of each of these early design decisions and make wise ones.
Design is an iterative process. It is best to start with the cleanest, simplest, most obvious choice and work up a sketch of the design (I actually recommend a prototype of the hot code) and evaluate the performance of that. You should also think about what the design would look like if you were to make performance the only factor, and then estimate how fast that application would be. Now the fun engineering begins! You start tinkering with the design and thinking about alternatives between these two extremes, looking for designs that give you the best result.
Again, my experience with the profile data processor is instructive. Like most projects, my choice of data representation was critical. Should the data be in-memory? Should it be streamed over in a file? Should it be in a database? The standard solution is that any large dataset should be stored in a database; however, databases are optimized for relatively slow change, not for having large volumes of data changing frequently. My application would be dumping many gigabytes of data into the database routinely. Could the database handle this? With just a bit of measurement and analysis of database operations, it was easy to confirm that databases did not have the performance profile I needed.
After some more measurements on how much memory an application can use before inducing excessive page faulting, I ruled out the in-memory solution as well. That left streaming data from a file for my basic data representation.
There were still many other design decisions to be made, however. The basic form of the profile data is a list of heterogeneous events. But what should events look like? Are they strings (which are nicely uniform)? Are they C# structs or objects?
If they are objects, the obvious solution is to make one allocation per event, which is a lot of allocations. Is that acceptable? How exactly does dispatch work as I iterate over the events? Is it a callback model or an iteration model? Does dispatch work through interfaces, delegates, or reflection? There were dozens of design decisions to be made, and they all would have impact on the ultimate performance of the program, so I needed to take measurements to understand the tradeoffs.
The Logistics of Measuring
Clearly you will be doing a lot of measuring during design. So exactly how do you do that? There are many profiling tools that can help, but one general-purpose technique that's also the simplest and most available is micro-benchmarking. The technique is simple: when you want to know how much a particular operation costs, you simply set up an example of its use and directly measure how much time the operation takes.
The .NET Framework has a high-resolution timer called System.Diagnostics.Stopwatch that was designed specifically for this purpose. The resolution varies with your hardware, but it typically has a resolution of less than 1 microsecond which is more than adequate. Since this comes with the .NET Framework, you already have the functionality you need.
While Stopwatch is a great start, a good benchmark harness should do more. Small operations should be placed in loops to make the interval long enough to measure accurately. The benchmark should be run once before taking a measurement to ensure that any just-in-time (JIT) compilation and other one-time initialization has completed (unless, of course, the goal is to measure that initialization). Since measurements are noisy, the benchmark should be run several times and statistics should be gathered to determine the stability of the measurement. It should also be easy to run many benchmarks (design variations) in bulk and get a report that displays all the results for comparison.
I have written a benchmark harness called MeasureIt.exe which builds upon the Stopwatch class and addresses these goals. It is available with the code download for this column on the MSDN® Magazine Web site. After unpacking, simply type the following in order to run it:
Within seconds it will run a set of more than 50 standard benchmarks and display the results as a Web page. An excerpt of the data is shown in Figure 1. In these results, each measurement performs an operation 10,000 times (the operation is cloned 10 times in a loop executed 1000 times). Each measurement is then performed 10 times and standard statistics (min, max, median, mean, standard deviation) are computed.
Figure 1 Measuring Various Operations with MeasureIt.exe
|MethodCalls: EmptyStaticFunction() [count=1000 scale=10.0]||1.000||1.005||0.084||0.922||1.136||10|
|MethodCalls: aClass.Interface() [count=1000 scale=10.0]||1.699||1.769||0.090||1.696||1.943||10|
|ObjectOps: new Class() [count=1000 scale=10.0]||6.248||8.040||3.556||5.087||16.296||10|
|Arrays: aIntArray[i] = 1 [count=1000 scale=10.0]||0.616||0.638||0.071||0.612||0.850||10|
|Delegates: aInstanceDelegate() [count=1000 scale=10.0]||1.233||1.244||0.088||1.160||1.398||10|
|PInvoke: FullTrustCall() [count=1000]||7.452||6.946||0.804||5.878||7.913||10|
|Locks: Monitor lock [count=1000]||11.487||12.129||0.901||11.322||13.843||10|
To make the time measurements more meaningful, they are normalized so that the median time for calling (and returning) from an empty static function is one unit. It is not uncommon for benchmarks to have widely varying times, which is why all the statistical information is important. This variation needs to be explained before you can trust the data from that benchmark. In that particular case, it is the result of the runtime periodically executing slower code paths to allocate bookkeeping data structures in bulk. Already, having these statistics available is proving useful in validating the data.
This table is a gold mine of useful performance data, detailing the costs of most of the primitive operations used by .NET-targeted code. I will go into detail in my next installment of this column, but here I want to explain an important feature of MeasureIt: it comes with its own source code. To unpack MeasureIt's source code and launch Visual Studio® to browse it (if Visual Studio is available), type this:
Having the source means that you can quickly understand exactly what the benchmark is measuring. It also means that you can easily add a new benchmark to the suite.
Again, my experience with the profile data processor is instructive. At one point on the design, I could do a certain common operation with either C# events, delegates, virtual methods, or interfaces. To make a decision, I needed to understand the performance tradeoff among these choices. Within minutes I had written the micro-benchmark to measure the performance of each of the alternatives. Figure 2 displays the relevant rows and you can see that there is no substantial difference between the alternatives. This knowledge allows me to choose the most natural alternative, knowing I was not sacrificing performance to do so.
Figure 2 Measuring .NET Events, Delegates, Interfaces, and Virtual Methods
|MethodCalls: aClass.Interface() [count=1000 scale=10.0]||1.651||1.660||0.084||1.579||1.814||10|
|MethodCalls: aClass.VirtualMethod() [count=1000 scale=10.0]||1.228||1.175||0.077||1.083||1.277||10|
|Delegates: aInstanceDelegate() [count=1000 scale=10.0]||1.151||1.159||0.085||1.075||1.314||10|
|Events: Fire Events [count=1000 scale=10.0]||1.228||1.195||0.070||1.088||1.291||10|
Validating Your Performance Results
The MeasureIt application makes collecting data for a broad variety of benchmarks very easy. Unfortunately, MeasureIt does not address an important aspect of using benchmark data: validation. It is extremely easy to measure something other than what you thought you were measuring. The result is data that is simply wrong, and worse than useless. The old adage "if it sounds too good (or bad) to be true, it probably is" definitely applies to performance data. It is imperative that you validate data that you use in any important design decision.
Validating Micro-Benchmark Data with a Debugger
What does it mean to validate performance results? It means collecting other information that also will predict the performance result and seeing if the two methodologies agree. For very small micro-benchmarks, inspecting machine instructions and making an estimate based on the number of instructions executed is an excellent check. In a debugger like Visual Studio, it should be as easy as setting a breakpoint in your benchmark code and switching to the disassembly window (Debug -> Windows ->Disassembly). Unfortunately, the default options for Visual Studio are designed to simplify debugging, not to do performance investigations, so you need to change two options to make this work.
First, go to Tools | Options... | Debugging | General and clear the Suppress JIT Optimization checkbox. This box is checked by default, which means that even when debugging code that should be optimized, the debugger tells the runtime not to do so. The debugger does this so that optimizations don't interfere with the inspection of local variables, but it also means that you are not looking at the code that is actually run. I always uncheck this option because I strongly believe that debuggers should strive to only inspect, and not to change the program being debugged. Note that unsetting this option has no effect on code that was compiled for debugging since the runtime would not have optimized that code anyway.
Next, clear the Enable Just My Code checkbox from Tools | Options | Debugging | General dialog. The Just My Code feature instructs the debugger not to show you code that you did not write. Generally, this feature removes the clutter of call frames that are often not of interest to the application developer. However, this feature assumes that any code that is optimized can't be yours (it assumes your code is compiled using the debug configuration or suppressed JIT Optimizations is turned on). If you allow JIT optimizations but don't turn off Just My Code, you will find that you never hit any breakpoints because the debugger does not believe your code is yours.
Once you have unchecked these options, they remain unchecked for ALL projects. Generally this works out well, but it does mean that you don't get the Just My Code feature. You may find yourself switching Just My Code on and off as you go from debugging to performance evaluation and back.
As an example of using a debugger to validate performance results, you can investigate an anomaly in the data shown in the excerpt in Figure 3. This data shows that calls to an interface method of a C# structure is many times faster than a call to a static method. This certainly seems odd, given that you would expect a static method call to be the most efficient type of call. To investigate this, you set a breakpoint in this benchmark and run the application. Switch to the disassembly window (Debug -> Windows -> Disassembly) and see that the whole benchmark consists of just the following code:
Figure 3 Using a Debugger to Validate Performance Results
|MethodCalls: EmptyStaticFunction() [count=1000 scale=10.0]||1.000||0.964||0.102||0.857||1.196||10|
|MethodCalls: aStructWithInterface.Interface() [count=1000 scale=10.0]||0.031||0.029||0.012||0.021||0.039||10|
aStructWithInterface.InterfaceMethod(); 00000000 ret
What this shows is that the benchmark (which is 10 calls to an interface method) has been inlined away to be nothing. The ret instruction is actually the end of the delegate body that defines the whole benchmark. Well, it is not surprising that doing nothing is faster than doing method calls, so this shows the reason for the anomaly.
The only mystery is why static methods don't get inlined, too. This is because for static methods, I specifically went out of my way to suppress inlining with the MethodImplOptions.NoInlining attribute. I intentionally "forgot" to put this on this interface call benchmark to demonstrate that the JIT compiler can make certain interface calls as efficient as non-virtual calls (there is a comment mentioning this above the benchmark).
To reiterate, it is very easy to measure something other than what you intended, especially when measuring small things that are subject to JIT compiler optimizations. It is also very easy to accidentally measure non-optimized code, or measure the cost of JIT compilation of a method rather than the method itself. The MeasureIt /usersGuide command will bring up a user's guide that discusses many of the pitfalls you might encounter when creating benchmarks. I strongly recommend that you read these details when you are ready to write your own benchmarks.
The point that I want to stress is this concept of validation. If you can't explain your data, you should not use it for making design decisions. If you have unusual data, ideally you should collect more data, debug the benchmarks, or collaborate with others who have more expertise until you can explain your data. You should be highly suspicious of unexplainable data, and should not use it in making any important decisions.
This discussion is about the basics of writing high-performance applications. Like any other attribute of software, good performance needs to be designed into the product from the beginning. To do this, you need measurements that quantify the tradeoffs of making various design decisions. This means doing performance experiments. MeasureIt makes it easy to generate good-quality micro-benchmarks quickly, and as such it should become an indispensable part of your design process. MeasureIt is also useful out of the box because it comes with a set of benchmarks that cover most of the primitive operations in the .NET Framework.
You can also easily add your own benchmarks for the part of the .NET Framework that most interests you. With this data you can form a model of application costs and thus make reasonable (rough) guesses about the performance of design alternatives even before you have written application code.
There is a lot more to say about the performance of applications in .NET. There are potential pitfalls associated with building micro-benchmarks, so please do read the MeasureIt users guide before writing any. I have also deferred discussion about situations where disk I/O, memory, or lock contention is the important bottleneck. I have not even discussed how to use various profiling tools to validate and monitor the performance health of your application after it has been designed.
There is a lot to know, and the sheer volume of information often discourages developers from doing any such testing at all. However, since most performance is lost in the design of an application, if you do nothing else, you should think about performance at this initial stage. I hope this column will encourage you to make performance an explicit part of the design on your next .NET software project.
Send your questions and comments to firstname.lastname@example.org.
Vance Morrison is the Compiler Architect for the CLR team at Microsoft, where he has been involved in the design of .NET since its inception. He drove the design for the .NET Intermediate Language (IL) and was lead for the just-in-time (JIT) compiler team.