Tips for Improving Time-Critical Code
Writing fast code requires understanding all aspects of your application and how it interacts with the system. This article suggests alternatives to some of the more obvious coding techniques to help you ensure that the time-critical portions of your code perform satisfactorily.
To summarize, improving time-critical code requires that you:
Know which parts of your program have to be fast.
Know the size and speed of your code.
Know the cost of new features.
Know the minimum work needed to accomplish the job.
To gather information on the performance of your code, you can use the performance monitor (perfmon.exe).
Sections in this Article
Cache Misses and Page Faults
Missed cache hits, on both the internal and external cache, as well as page faults (going to secondary storage for program instructions and data) slow the performance of a program.
A CPU cache hit can cost your program 10-20 clock cycles. An external cache hit can cost 20-40 clock cycles. A page fault can cost a million clock cycles (assuming a processor that handles 500 million instructions/second and a time of 2 millisecond for a page fault). Therefore, it is in the best interest of program execution to write code that will reduce the number of missed cache hits and page faults.
One reason for slow programs is that they take more page faults or miss the cache more often than necessary. To avoid this problem, it's important to use data structures with good locality of reference, which means keeping related things together. Sometimes a data structure that looks great turns out to be horrible because of poor locality of reference, and sometimes the reverse is true. Here are two examples:
Dynamically allocated linked lists can reduce program performance. When you search for an item, or when you traverse a list to the end, each skipped link could miss the cache or cause a page fault. A list implementation based on simple arrays might be faster because of better caching and fewer page faults. Even if you allow for the fact that the array would be harder to grow, it still might be faster.
Hash tables that use dynamically allocated linked lists can degrade performance. By extension, hash tables that use dynamically allocated linked lists to store their contents might perform substantially worse. In fact, in the final analysis, a simple linear search through an array might actually be faster (depending on the circumstances). Use of an array-based hash table (so-called "closed hashing") is an often-overlooked implementation that frequently has superior performance.
Sorting and Searching
Sorting is inherently time consuming compared to many typical operations. The best way to avoid unnecessary slowdown is to avoid sorting at critical times. You may be able to:
Defer sorting until a non-performance-critical time.
Sort the data at an earlier, non-performance-critical time.
Sort only the part of the data that truly needs sorting.
Sometimes, you can build the list in sorted order. Be careful, because if you need to insert data in sorted order, you may require a more complicated data structure with poor locality of reference, leading to cache misses and page faults. There's no approach that works in all cases. Try several approaches and measure the differences.
Here are some general tips for sorting:
Use a stock sort to minimize bugs.
Any work you can do beforehand to reduce the complexity of the sort is worthwhile. If a one-time pass over your data simplifies the comparisons and reduces the sort from O(n log n) to O(n), you'll almost certainly come out ahead.
Think about the locality of reference of the sort algorithm and the data you expect it to run on.
There are fewer alternatives for searches than for sorting. If the search is time-critical, a binary search or hash table lookup is almost always best, but as with sorting, you must keep locality in mind. A linear search through a small array can be faster than a binary search through a data structure with many pointers that causes page faults or cache misses.
MFC and Class Libraries
The Microsoft Foundation Classes (MFC) can greatly simplify writing code. When writing time-critical code, you should be aware of the overhead inherent in some of the classes. Examine the MFC code that your time-critical code uses to see if it meets your performance requirements. The following list identifies MFC classes and functions you should be aware of:
CString
MFC calls the C run-time library to allocate memory for aCString
dynamically. Generally speaking,CString
is as efficient as any other dynamically allocated string. As with any dynamically allocated string, it has the overhead of dynamic allocation and release. Often, a simplechar
array on the stack can serve the same purpose and is faster. Don't use aCString
to store a constant string. Useconst char *
instead. Any operation you perform with aCString
object has some overhead. Using the run-time library string functions may be faster.CArray
ACArray
provides flexibility that a regular array doesn't, but your program may not need that. If you know the specific limits for the array, you can use a global fixed array instead. If you useCArray
, useCArray::SetSize
to establish its size and specify the number of elements by which it grows when a reallocation is necessary. Otherwise, adding elements can cause your array to be frequently reallocated and copied, which is inefficient and can fragment memory. Also, if you insert an item into an array,CArray
moves subsequent items in memory and may need to grow the array. These actions can cause cache misses and page faults. If you look through the code that MFC uses, you may see that you can write something more specific to your scenario to improve performance. SinceCArray
is a template, for example, you might provideCArray
specializations for specific types.CList
CList
is a doubly linked list, so element insertion is fast at the head, tail, and at a known position (POSITION
) in the list. Looking up an element by value or index requires a sequential search, however, which can be slow if the list is long. If your code doesn't require a doubly linked list, you may want to reconsider usingCList
. Using a singly linked list saves the overhead of updating another pointer for all operations and the memory for that pointer. The extra memory isn't large, but it's another opportunity for cache misses or page faults.IsKindOf
This function can generate many calls and may access memory in different data areas, leading to bad locality of reference. It's useful for a debug build (in an ASSERT call, for example), but try to avoid using it in a release build.PreTranslateMessage
UsePreTranslateMessage
when a particular tree of windows needs different keyboard accelerators or when you must insert message handling into the message pump.PreTranslateMessage
alters MFC dispatch messages. If you overridePreTranslateMessage
, do so only at the level needed. For example, it isn't necessary to overrideCMainFrame::PreTranslateMessage
if you're interested only in messages going to children of a particular view. OverridePreTranslateMessage
for the view class instead.Don't circumvent the normal dispatch path by using
PreTranslateMessage
to handle any message sent to any window. Use window procedures and MFC message maps for that purpose.OnIdle
Idle events can occur at times you don't expect, such as betweenWM_KEYDOWN
andWM_KEYUP
events. Timers may be a more efficient way to trigger your code. Don't forceOnIdle
to be called repeatedly by generating false messages or by always returningTRUE
from an override ofOnIdle
, which would never allow your thread to sleep. Again, a timer or a separate thread might be more appropriate.
Shared libraries
Code reuse is desirable. However, if you're going to use someone else's code, you should make sure you know exactly what it does in those cases where performance is critical to you. The best way to understand it is by stepping through the source code or by measuring with tools such as PView or Performance Monitor.
Heaps
Use multiple heaps with discretion. Additional heaps created with HeapCreate
and HeapAlloc
let you manage and then dispose of a related set of allocations. Don't commit too much memory. If you're using multiple heaps, pay special attention to the amount of memory that is initially committed.
Instead of multiple heaps, you can use helper functions to interface between your code and the default heap. Helper functions facilitate custom allocation strategies that can improve the performance of your application. For example, if you frequently perform small allocations, you may want to localize these allocations to one part of the default heap. You can allocate a large block of memory and then use a helper function to suballocate from that block. Then you won't have multiple heaps with unused memory, because the allocation is coming out of the default heap.
In some cases, however, using the default heap can reduce locality of reference. Use Process Viewer, Spy++, or Performance Monitor to measure the effects of moving objects from heap to heap.
Measure your heaps so you can account for every allocation on the heap. Use the C run-time debug heap routines to checkpoint and dump your heap. You can read the output into a spreadsheet program like Microsoft Excel and use pivot tables to view the results. Note the total number, size, and distribution of allocations. Compare these results with the size of working sets. Also look at the clustering of related-sized objects.
You can also use the performance counters to monitor memory usage.
Threads
For background tasks, effective idle handling of events may be faster than using threads. It's easier to understand locality of reference in a single-threaded program.
A good rule of thumb is to use a thread only if an operating system notification that you block on is at the root of the background work. Threads are the best solution in such a case because it's impractical to block a main thread on an event.
Threads also present communication problems. You must manage the communication link between your threads, with a list of messages or by allocating and using shared memory. Managing the communication link usually requires synchronization to avoid race conditions and deadlock problems. This complexity can easily turn into bugs and performance problems.
For more information, see Idle Loop Processing and Multithreading.
Small Working Set
Smaller working sets mean better locality of reference, fewer page faults, and more cache hits. The process working set is the closest metric the operating system directly provides for measuring locality of reference.
To set the upper and lower limits of the working set, use
SetProcessWorkingSetSize
.To get the upper and lower limits of the working set, use
GetProcessWorkingSetSize
.To view the size of the working set, use Spy++.