This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

MSDN Magazine

Improving Runtime Performance with the Smooth Working Set Toolâ€"Part 2

John Robbins

Download the code for this article: Bugslayer1200.exe (1,933KB)
Browse the code for this article at Code Center:Smooth Working Set 2

In my October column, I began a discussion of a tool I was writing called Smooth Working Set (SWS), which is intended to replace the Working Set Tuner (WST) that used to be part of the Platform SDK, but is no longer available. This month I will conclude my discussion of SWS, so you no longer have an excuse for a working set that isn't small, smooth, and svelte. Just in case you didn't read my October column, or you're reading this column in the restroom (where you may not have Internet access to my previous column), I will present a quick overview of SWS.
      As everyone in software development knows, smaller is better: the less memory that your application uses, the faster it will run. The working set represents the memory your application takes up. SWS's job is to help you determine which functions are called most frequently so that it can build an order file that the linker can use to place the most frequently called functions together. This means that fewer memory pages are needed to make your application turn over. The fewer memory pages you have, the fewer page faults you have, and the faster your application will run. Read my October column to understand the SWS design philosophy.

Using SWS

      Using SWS is a three-stage process. The first stage involves recompiling your application to get SWS hooked in so you can collect the function execution data. The second stage involves running the most common user scenarios using the special compiled version. To use SWS correctly, you must spend some time determining exactly what those user scenarios are so that you can duplicate them precisely. Just running your application randomly under SWS won't help reduce your working set much at all. As I mentioned last time, what you do with your application is not necessarily the same thing a typical user does with it.
      The third stage involves generating the order file for the linker (which is very simple to do), and integrating that order file into your final build. The whole SWS system consists of three DLLs and a single executable. Figure 1 gives a brief overview of each file.
      Getting your application compiled for SWS is generally straightforward; you simply follow these steps:

  1. Clone the configuration that you want to tune by using the Configuration dialog box accessed through the Configurations menu item on the Build menu. In general, you will want to clone your release build, but you can also create a debug build. I like to append "SWS" to the new configuration I would have if it was a clone (Foo Win32 Release becomes Foo Win32 Release SWS) to identify it as the clone. Once you have cloned the configuration, make that the active one.
  2. In the Project Options dialog Visual C++® tab, select Program Database in the Debug Info combo box. This turns on the /Zi switch so PDB files are generated.
  3. In the Project Options dialog Visual C++ tab, select the Customize category and check the "Enable function-level linking" checkbox. This turns on the /Gy switch to package individual functions.
  4. In the Project Options dialog Visual C++ tab, move down to the Project Options edit box. In the edit box, type "/Gh" to enable hook function calls. The /Gh switch will tell the compiler to insert a call to the function called _penter in each function's prolog. In essence, this will instrument each function in the compiled module and have it automatically call into SWS. Figure 2 shows where the /Gh needs to be entered.
    Figure 2 The /Gh Switch in Project Settings
    Figure 2 The /Gh Switch in Project Settings
  5. In the Project Options dialog Link tab Debug category, check Debug Info and the Microsoft format radio button. This adds the /DEBUG switch to the linker command line.
  6. If you are going to generate SWS data on release builds, in the Project Options dialog Link tab move down to the Project Options edit box. In the edit box, type /OPT:REF. The /DEBUG switch implicitly turns on the /OPT:NOREF flag that brings all functions into the final binary whether they are ever referenced or not, thus drastically increasing the size of your binary. By adding the /OPT:REF flag, you ensure the linker will only bring in those functions that have actually been called by your application. Figure 3 shows where to enter the /OPT:REF option.
    Figure 3 The /OPT:REF Option in Project Settings
    Figure 3 The /OPT:REF Option in Project Settings
  7. In the Project Options dialog Input category, add SWSDLL.LIB to the Object/library modules edit box. You may need to add the path to SWSDLL.LIB, depending on where you place it on your system and which directories you have designated as those Visual C++ should automatically search through when looking for libraries. SWSDLL.LIB contains the reference to the _penter function that is automatically placed by the /Gh switch.

      After you have your application compiled, it's time to start running it to generate the SWS data. For any size application, you should pregenerate the base SWS data files by running SWS.EXE with the -g command-line option followed by each of the individual modules you compiled. Since the symbol lookup in DBGHELP.DLL is very slow, I opted to cache the symbol information into a format that I can quickly load and look up when your functions call the _penter function in SWSDLL.DLL. You don't have to pregenerate the core data files, because if SWSDLL.DLL does not find them on the first call from your module, it will generate them.
      The two core data files generated are modulename.SWS and modulename.SDW. The .SWS file contains nothing but the addresses in your module and room for the execution counts. The .SDW file holds the addresses and names of all your functions. I broke out the names into a separate file because there was no need to drag them into each specific run's file, and it helped reduce memory overhead. When you dump a .SWS file using the -d command-line option to SWS.EXE, SWS.EXE does all the work to match up the SWS and SDW files. Figure 4 shows all the command-line options to SWS.EXE
      If you run your application a couple of times, you will notice that a few more .SWS files appear in the same directory where your special compiled binary resides. Each run's data is stored in modulename.#.SWS, where # is the number of the run. The base .SWS file does not contain execution counts, so you can dump a specific run to see what you executed by using the -d command-line option to SWS.EXE and passing the specific run file on the command line.
      Once you've run all of your common user scenarios, it's time to tune your application and generate the order file. SWS.EXE provides the front end to the tuning with the -t command-line option followed by just the module name of the binary to tune. Tuning produces two files, a .TWS file and the actual order file. A .TWS file contains the summed execution counts sorted in highest to lowest order. You can dump the .TWS file with SWS.EXE just like a regular .SWS file. The order file has a .PRF extension, mainly because that's what the old Working Set Tuner tool produced. The .PRF file is just a text file that you pass to the linker.
      Once you have the .PRF order file, it's time to apply the reordering to the linker. You can apply the following steps to your regular build, but I opt to create yet another configuration to indicate the build is special. The following steps show you how to get the order file integrated.

  1. Clone the same configuration that you used to create the special SWS build. I like to append the word "Tuned" to the name of the configuration so I know what it is.
  2. In the Project Options dialog Visual C++ tab, select the Customize category and check the "Enable function-level linking" checkbox. This turns on the /Gy switch to package individual functions.
  3. In the Project Options dialog Link tab, move down to the Project Options edit box. In the edit box, type "/ORDER:@ <orderfile>.PRF". Make sure you include the full path to the order file after the @.

      Included in this month's code distribution are several test programs so you can see how to apply all of the settings and get an idea how SWS works. The SimplePEEnterTest program is, as the name implies, the simplest example program and a good place to start if you are interested in stepping through to see how SWS does its magic. Two of the test programs, MultiDLLs and MultiThreads, have multiple DLLs to test the multiple module processing and multithread processing, respectively. The final test program is WordPad from the Visual C++ samples. WordPad is the largest common source code sample on the MSDN Web site and can give you an idea about the overhead associated with running SWS. In general, I feel the overhead is not that bad considering the excellent benefits you can get from SWS. Anytime you can get your program smooth and svelte is a good thing!

Implementation Highlights

      Now that you know how to run SWS, I want to turn to some of the implementation highlights so you can get an idea how SWS works under the covers. SWS is not exactly rocket science, but I found it quite fun to implement. The most interesting part of SWS is the _penter function that's automatically generated by the compiler when you use the /Gh switch.
      Figure 5 shows the code for my _penter. As you can see from the code, it's naked and I generate my own prolog and epilog to get the start of the function. If you remember past Bugslayer columns, the reason I go naked is to make it easy to get the return address of the function. Fortunately, when the compiler says it will generate _penter before anything else, it means it! The following disassembly shows the effects of the /Gh switch. As you can see, the call to _penter comes even before the PUSH EBP standard function prolog.

00401050: E8B7000000 call _penter
00401055: 55 push ebp
00401056: 8BEC mov ebp,esp
00401058: E8A8FFFFFF call ILT+0(?Foo
0040105D: 3BEC cmp ebp,esp
0040105F: E8AE000000 call _chkesp
00401064: 5D pop ebp
00401065: C3 ret

      If you daydream a little bit you can see that the /Gh switch might allow some other interesting utilities. The first one that pops into my mind is a performance tool. Unfortunately, since the compiler does not offer an epilog exit, you will have to do a little more work to keep everything straight. Maybe if we all ask Microsoft nicely, they will implement the epilog exit switch.
      In the October issue, I discussed the design of the file DLL, SWSFILE.DLL, and how I approached the issue of making the individual runs fast. I thought I was all done with the file handling, but the day after I submitted that column, I realized I forgot something very important.
      When generating the initial .SWS file, I was using the addresses as they came out of the module. The problem is: what would happen if the module is relocated in memory? The SWSDLL.DLL runtime would be called with one address, but I would not have any record of that address in any of the module's SWS files that are loaded. While everyone should always be rebasing their DLLs, sometimes people forget, and I wanted to make sure SWS didn't crater when that happened. Consequently, I had to go back and add the original load address into the SWSFILE.DLL. In the runtime itself, I had to add code to check if a module was relocated as well, to keep everything kosher.
      One area that did give me a little trouble was generating the symbols for the initial SWS module. Because of the way programs are linked and symbols are generated, many of the symbols reported in a module are not those that have _penter calls inserted in them. For example, if you link against the static C runtime, your module will have all sorts of C runtime functions added. Since the address lookup would be faster in the SWS runtime if there were fewer symbols, I looked at a few ways to minimize the numbers.
      Figure 6 shows the symbol enumeration callback and how I started limiting the number of symbols. The first step I took was to check if the symbol had corresponding line information with it. Because I assume that functions that have _penter calls were properly compiled using the steps I specified earlier, I safely got rid of many extraneous symbols. The next test to eliminate symbols was to check if specific strings are part of the symbols. For example, any symbols that start with "_imp__" are imported functions from other DLLs. There are two other checks that I did not implement, but left as exercises for you dear readers. The first is that you should be able to flag symbols from specific files, which SWS should ignore. The main reason for implementing this feature is so that you can add all the C runtime source files to that list. The last symbol elimination trick ensures that the address in question only comes from a code section in the module. You might not need this last check, but it would ensure that only true code symbols are used.
      One symbol problem that I had at runtime happened because the symbol engine does not return static functions. Being Mr. Contentious, if I did not find an address that came out of a module, I popped my usual six or seven assertion message boxes. At first I was a little confused that I was seeing the assertions, because one of my test programs did not have anything declared as static. When I popped up the stack in the debugger, I found I was looking at a symbol named something like $E127. There was a call to _penter in the function and everything looked good. It finally dawned on me that I was looking at a compiler-generated function, such as a copy constructor. While I would have really liked to keep the error checking in the code, I noticed that there were quite a few of those static/compiler-generated functions in WordPad, so all I could do was report the problem with a TRACE statement in debug builds.
      The last interesting part of SWS is the tuning of a module. The code for the TuneModule function is large, so Figure 7 shows the algorithm. As you can see, I work to ensure that I pack each code page with as many functions as possible to eliminate padding. The interesting part is where I hunt down the best fitting function. I decided to try to pack as many functions with execution counts into the pages as possible. If I can't find a function with an execution count that fits, I will use a function that has no execution counts. My initial algorithm for fitting everything together worked great. However, it started crashing when tuning certain modules.
      A little exploration revealed that I was getting into a situation where I had a page almost filled, but only had a function whose size was bigger than the page. That's right, a function size reported by the symbol engine was bigger than a memory page. When I looked more closely, I noticed that those huge functions only appeared when they were the last symbols in the code section. Evidently, the symbol engine treats everything after certain symbols as part of the symbol, so the size is wrong. In the tuning algorithm, you can see that if I get a symbol larger than the page size, the only thing I can do is punt and drop the symbol into the order file. That might not be the best solution, but it's a boundary condition that you shouldn't run into too often.

What's Next for SWS?

      As it stands, SWS is good enough for your module slimming and svelting needs. If you are interested in SWS, here are a few cool things you might do in future versions:

  • Implement a start and stop program. I have the code hooked up in _penter to check if an event is signaled. You can create a separate program that toggles the event so you can control SWS's data collection. Just create an event named "SWS_Start_Stop_Event" and set it when you want to stop data collection.
  • Implement the symbol exclusion features I discussed earlier so that you can have the fewest symbols possible in the .SWS files.
  • If you are really ambitious, you can write a GUI tool to make viewing data and tuning much easier than it is using a command-line utility.


      Even though the Working Set Tuner disappeared, with SWS there's no excuse for having extra fat in your working set. If you use SWS on your application, I'd be curious to know how much space you actually save in the end. Even though SWS was an ambitious utility and took two columns to cover, I think it was well worth it.
      In my next column, I will start tackling debugging in Microsoft® .NET. While you might think that .NET is supposed to make all your bugs and problems go away, I've been playing with it and all it means is that there will be new and different debugging challenges for us to tackle. .NET will provide enough material for many cool debugging columns.
      In this month's source code distribution, I have included an updated version of my BugslayerUtil.DLL. Included is a new option to write your assertions to the event log under Windows NT® 4.0 and Windows® 2000. If you are working on server applications that don't have UIs, getting the assertion output in a common place can make all the difference in the world. Also included is a bug fix in HookImportedFunctionsByName reported by Attila Szepesv�ry and Tim Tabor. Finally, thanks to Craig Ball for reporting that Crash Handler didn't report that an application crashed on a Visual C++ exception.

Da Tips!

      Guess what? The holidays are almost upon us. If you don't send your tips to me at, you might not be getting any presents for Hanukkah or Christmas!**
Tip 39** Microsoft has released an interesting utility called PageHeap to help track memory corruption problems. You can read more about PageHeap in Knowledge Base article Q264471.**
Tip 40** Don't you just hate it when you turn on memory leak detection in the C runtime and all of your leaks allocated by the new operator come out CRTDBG.H and not where you allocated memory? That drives me nuts! The problem is that there is a bug in CRTDBG.H in that the new operator is declared as an inline function. Since debug builds turn all inlining off, the new operator becomes another function and the __FILE__ macro expands to CRTDBG.H. Fortunately, I found a workaround. Make all of your precompiled headers look like the following:

  #ifndef _STDAFX_H
#define _STDAFX_H

// This define must occur before any headers are included.

// Include all other headers here!

// Include CRTDBG.H after all other headers

#include <crtdbg.h>
__FILE__ , __LINE__ )
#endif // _STDAFX_H

      The one drawback to this approach is that you need to ensure that all STL headers in particular are only included in your precompiled header file. If they are included after the precompiled header, you will get compilation errors. Additionally, if you have custom new operators for a class, you will also get errors. You will need to undefine new before declaring your class and perform the defines I just mentioned after your class. Also, include a placement operator version of new in your class that matches the one in CRTDBG.H so you can get the source and line information.

John Robbins is a cofounder of Wintellect, a software consulting, education, and development firm that specializes in programming in Windows and COM. He is the author of Debugging Applications (Microsoft Press, 2000). You can contact John at

From the December 2000 issue of MSDN Magazine