The Case of the System Process CPU Spikes
As you’ve probably surmised by my blog posts and other writings, I like knowing exactly what my systems are doing. I want to know if a process is running away with the CPU, causing memory pressure, or hitting the disk. Besides keeping my computers running smoothly, my vigilance sometimes helps me spot performance and reliability problems in Windows and third-party code.
The main way I keep tabs on things is to configure Process Explorer to run automatically when I log in. Whenever I configure a new computer, I add a shortcut to Process Explorer to my profile’s Start directory that includes the /t (minimize) switch. Process Explorer runs otherwise hidden with tray icon that shows a small historical view of CPU activity level. Because I want access to detailed information about system processes, as well as my own, I also specify the /e option on Vista, which causes Windows to present a UAC prompt on logon that allows me to grant Process Explorer administrative rights.
Because I keep an eye out for CPU spikes in Process Explorer’s tray icon, which show up as green or red for user-mode (application) and kernel-mode (operating system and drivers) CPU usage, respectively, I’ve identified several application bugs over the last few months. In this post, I’ll share how I used both Process Explorer and another tool, Kernrate, to identify a problem with a third-party driver and followed the problem through to a fix by the vendor.
Not long after I got a new laptop several months ago, I noticed that the system sometimes felt sluggish. Process Explorer’s tray icon corroborated my perception by displaying a mini-graph of red CPU activity. The icon opens a tooltip that reports the name of the process consuming the most CPU when you move the mouse over it, and in this case the tooltip showed the System process as being responsible:
The first few times I noticed the problem, it resolved itself shortly after and I didn’t have a chance to troubleshoot. However, I could see by opening Process Explorer’s System Information dialog that the CPU spikes were significant:
The System process is special because it doesn’t host an executable image like other processes. It exists solely to host operating system threads for the memory manager, cache manager, and other subsystems, as well as device driver threads. These threads execute entirely in kernel mode, which is why System process CPU usage shows up as red in Process Explorer’s graphs.
I suspected that a third-party device driver was the cause of the problem, so the first step in my investigation was to figure out which thread was using CPU, which would hopefully point me at the guilty party. I watched vigilantly for signs of trouble every time I switched networks and jumped the first time I saw one. Process Explorer shows the threads running in a process on the Threads page of the Process Properties dialog, so I double-clicked on the System process and switched to the Threads page the next time I noticed the CPU spike:
The “ntkrnlpa.exe” prefix on each thread’s start address identified the ones I saw at the top of the CPU usage sort order as operating system threads (Ntkrnlpa.exe is the version of the kernel loaded on 32-bit client systems that have no execute memory protection or server systems that need to address more than 4GB of memory). Because I had previously configured Process Explorer to retrieve symbols for operating system images from the Microsoft public symbol server, the thread list also showed the names of the thread start functions. The most active threads began in the ExpWorkerThread function, which means that they were worker threads that perform work on behalf of the system and device drivers. Instead of creating dedicated threads that consume memory resources, the system and drivers can throw work at the shared pool of operating system worker threads.
Unfortunately, knowing that worker threads were causing the CPU usage didn’t get me any closer to solving identifying a root cause. I really needed to know what functions the worker threads were calling, because the functions would be inside the device driver or operating system component on whose behalf the threads were running. One way to look inside a thread’s execution is to look at the thread’s stack with Process Explorer. The stack is a memory region that stores function invocations and Process Explorer will show you a thread’s stack when you select the thread press the Stack button or double-click on the thread. On Vista, however, you get this error when you try and look at the stack for threads in the System process:
The System process is a special type of process on Vista called a “protected process” that doesn’t allow any access to its threads or memory. Protected processes were introduced to support Digital Rights Management (DRM) so that hi-definition content providers can store content encryption keys with a reduced risk of an administrative user using DRM-stripping tools to reach into the process and read the keys.
That approach foiled, I had to find another way to see what the worker threads were doing. For that, I turned to KernRate, a command-line profiling tool that’s a free download from Microsoft. KernRate can profile user-mode processes and kernel-mode threads. It uses the sample-based profiling facility that was introduced in the first release of Windows NT, which records the unique addresses at which the CPU is executing when the profiling interval timer fires. When you stop a profile capture, Kernrate retrieves the information from the kernel, maps the addresses to the loaded device drivers into which the fall, and can even use the symbol engine to report the names of functions.
I wouldn’t need symbols if the trace identified a device driver, so I ran Kernrate without passing it any arguments. Despite the fact that there’s no officially supported version of Kernrate for Vista, the version for Windows XP, Kernrate_i386_XP.exe, works on Vista 32-bit (you can also use the recently-released xperf tool to perform similar profiling - xperf requires Vista or Server 2008, but works on 64-bit versions). I let the profile run through heavy bursts of CPU and then hit Ctrl+C to print the results to the console window:
In first place were hits in the kernel, but in second was a driver that I didn’t recognize, b57nd60x. Most driver files are located in the %systemroot%\system32\drivers directory, so I could have opened that folder and viewed the file’s properties in Explorer, but I had Process Explorer open so a quicker way to check the driver’s vendor and version was to open the DLL view for the System process. The DLL view shows the DLLs and files mapped into the address space of user-mode processes, but for the System process it shows the kernel modules, including drivers, loaded on the system. The DLL view revealed that the driver was for my laptop’s NIC, was from Broadcom, and was version 10.10:
Now that I knew that the Broadcom driver was causing the CPU usage, the next step was to see if there was a newer version available. I went to Dell’s download page for my system, but didn’t find anything. Suspecting that what I noticed might not be a known issue, I decided to notify Broadcom. I used contacts on the hardware ecosystem team here at Microsoft to find the Broadcom driver representative and email him a detailed description of the symptoms and my investigation. He forwarded my email to the driver developer, who acknowledged that they didn’t know the cause and within a few days sent me a debug version of the driver with symbols so that I could capture a Kernrate profile that would tell them what functions in the driver were active during the spikes. The problem reoccurred a few days later and I sent back the kernrate output with function information.
The developer explained that my trace revealed that the driver didn’t efficiently interact with the PCIe bus when processing specific queries and the problem seemed to be exacerbated by my particular hardware configuration. He gave me new driver for me to try and after a few weeks of monitoring my laptop closely for issues, I confirmed that the problem appeared to be resolved. The updated driver has not yet been posted to Dell’s support site, but I expect it to show up there in the near future. Another case closed, this time with Process Explorer, Kernrate, and a helpful Broadcom driver developer.
If you like these troubleshooting blog posts, you’ll enjoy the webcast of my “Case of the Unexplained…” session from TechEd/ITforum. Its 75 minutes are packed with real-world troubleshooting examples, including the one written up in this post and others, as well as some that I haven’t documented. At the end of the session I ask the audience to send me screenshots, log files and descriptions of their own troubleshooting success stories, in return for which I’ll send back a signed copy of Windows Internals. The offer stands, so remember to document your investigation and you can get a free book. I’ve gotten a number of great examples and my next blog post will be a guest post by someone that watched the webcast and used Process Monitor to solve a problem with their web server.
Finally, if you want to see me speak live, come to TechEd US/IT Pro in June in Orlando where I’ll be delivering “The Case of the Unexplained…”, “Windows Server 2008 Kernel Advances”, and “Windows Security Boundaries”. Hope to see you there!