Measuring Processor Utilization and Queuing Delays in Windows applications
Continuing my answer to the mail I received recently from Uriel Carrasquilla…
Uri’s note, reprinted in the previous post, refers to an "issue" associated with the current technique for measuring processor utilization in Windows. As my reply mentioned, these are documented and well-understood issues. At the core is the methodology used to calculate processor utilization that was originally designed 20 years ago for Windows NT. Since one of the original goals of Windows NT was to be hardware independent, the measurement methodology was also designed so that it was not dependent on any specific set of hardware measurement features. And therein lies a tale.
This methodology was amply documented in my Windows 2003 Server Performance Guide, published in the Windows 2003 Server Resource Kit. As you know, I wrote that book before I came to work at Microsoft as a full-time employee, but the Windows Server performance team that helped me at the time was quite open to documenting the existing facility, warts and all.
Briefly summarizing what I wrote there (and which still applies today), the % Processor Time counters in Windows are measurements derived using a sampling technique. The OS Scheduler samples the state of the CPU once per system clock tick, driven by a high priority timer-based interrupt. Currently, the clock tick interval the OS Scheduler uses is usually 15.6 ms. (The precise value that the OS uses between timer interrupts is available by calling the GetSystemTimeAdjustment() function.) If the processor is running the Idle loop when the quantum interrupt occurs, it is recorded as an Idle Time sample. If the processor is running some application thread, that is recorded as a busy sample. Busy samples are accumulated continuously at both the thread and process level. When a clock interrupt occurs, the Scheduler performs a number of other tasks, including adjusting the dispatching priority of threads that are currently executing, stopping the progress of any thread that has exceeded its time slice, as well as performing its CPU accounting.
The System\Processor Queue Length counter in Perfmon is an instantaneous counter that reflects the current number of Ready threads waiting in the OS Scheduler queue. When a performance monitoring application such as Perfmon requests the Processor Queue counter, there is a measurement function that traverses the Scheduler Ready Queue and counts the number of threads waiting for an available processor. Thus, the System\Processor Queue Length counter represents one sampled observation, and needs to be interpreted with that in mind. (If memory serves, the data collection process that Charles’ analysis relies upon gathers samples of this measurement several times per minute, & his servers are not idle by design. Which basically means I think what he is doing is just fine.)
What I believe Uri is referring to with regard to this processor Queue Length metric not being “correct” is anomalies in this measurement that are due to the kind of phased behavior you can sometimes see on an otherwise idle system. Even on an mostly idle system, a sizable number of threads can be waiting on the same clock interrupt (typically, polling the system state once per second), one of which also happens to be the Perfmon measurement thread, also cycling once per second. These sleeping threads tend to clump together so that they get woken up at the exact same time by the timer interrupt. (As I mentioned, this happens mainly when the machine is idling with little or no real work to do.) These awakened threads then flood the OS dispatching queue at exactly the same time. If one of these threads is the Perfmon measurement thread that gathers the Processor Queue Length measurement, you can see how this “clumping” behavior could distort the measurements. The Perfmon measurement thread executes at an elevated priority level of 15, so it is scheduled for execution ahead of any other User mode threads that were also awakened by the same Scheduler clock tick. The effect is that at the precise time when the Processor ready queue length is measured, there are likely to be a fair number of Ready Threads. Compared to the modeling assumption where processor scheduling is subject to random arrivals, one observes a disproportionate number of Ready Threads waiting for service, even (or especially) when the processor itself is not very busy overall.
This anomaly is best characterized as a low-utilization effect that perturbs the measurement when the machine is loafing. It generally ceases to be an issue when processor utilization climbs or there are more available processors on the machine. But this bunching of timer-based interrupts remains a serious concern, for instance, whenever Windows is running as a guest virtual machine under VMware or Hyper-V. (Please don’t get me started on that topic.) Another interesting side discussion is how this clumping of timer-based interrupts interacts with power management, but I do not intend to go there either.
To summarize, the CPU utilization measurements at the system, process and thread level in Windows are based on a sampling methodology. Similarly, the processor queue length is also sampled. Like any sampling approach, the data gathered is subject to typical sampling errors, including
· accumulating a sufficient number of sample observations to be able to make a reliable statistical inference about the underlying population, and
· ensuring that there aren’t systemic sources of sampling error that causes sub-classes of the underlying population to be under or over-sampled markedly
So, these CPU measurements face familiar issues with sampling size and the potential for systematic sampling bias, as well as the usual difficulty in ensuring that the sample data is representative of the underlying population (something known as non-sampling error). For example, the interpretation of the CPU utilization data that Perfmon gathers at the process and thread level is subject to limitations based on a small sample size for collection intervals less than, for example, 15 seconds. The Performance Guide has a more detailed discussion of these issues, if you are interested.
Event-driven measurement approaches.
What exactly Windows is going to do about this, I couldn't say -- I work in a different part of the company -- but I have consistently lobbied for a more accurate, event-driven approach to gathering CPU measurements. Efficient power management, for example, strongly argues for an event-driven approach. You do not want the OS to wake up periodically on an idle machine that could otherwise be powered down just to perform its CPU usage accounting duties, for example.
It is encouraging that recent versions of Windows have taken major steps in this direction, supporting a more accurate event-driven approach using instrumentation added to the Scheduler to measure CPU utilization. The improvements in this area have not gotten much notice, which is something I will try to rectify a bit here. Windows exploits a machine timing facility that is present in both x86 and x64 hardware, namely, the rdtsc Read TimeStamp Counter instruction. Moreover, these improvements position the Windows OS so it can replace its legacy CPU measurement facility with something more reliable and accurate sometime in the near future. Unfortunately, converting all existing features in Windows, including Perfmon and Task Manager, to support the new measurements is a big job, not without its complications and not always as straightforward as one would hope. (One of the complications is using rdtsc on older hardware where the hardware tick rate changes when there is a power management event. There are also issues of clock drift across multiprocessor cores when they are not re-synchronized periodically. I do not want to take the time to discuss these issues in detail here.)
To see this new processor measurement facility at work, access the new Resource Monitor application (resmon.exe) that is available beginning in Vista and Windows Server 2008. Resource Monitor can be launched directly from the command line, or from either Performance Monitor or Task Manager. In case you are not familiar with it, here is a screen shot that shows Resource Monitor in action on a Windows Server 2008 R2 machine, calculating CPU utilization over the last 60 seconds of operation, breaking out that utilization by process. The CPU utilization measurements that ResMon calculates are based on new OS Scheduler instrumentation. These measurements are very accurate, about as good as it gets from a vantage point inside the OS.
Figure 1. The Windows 7 Resource Manager application. (Click for full size image.)
The Resource Monitor measures CPU busy in real time using event-orient measurement data gathered by the OS Scheduler each time a context switch occurs. A context switch occurs in Windows whenever the processor switches the processor execution context to run a different thread. Context switches also occur as a result of high priority Interrupt Service Routines (ISRs), as well as the Deferred Procedure Calls (DPCs) that ISRs schedule to complete the interrupt processing. Starting in Windows 6, the OS Scheduler began issuing rdtsc instructions to get the internal processor clock each time a context switch occurs. (Note: The Windows OS Scheduler not only orders ready threads in its dispatcher queue by priority, it also lets a higher priority thread preemptively interrupt the execution of a lower priority one. This is known as Preemptive Scheduling with Priority Queuing. Once executing, a thread is also subject to a maximum time-slice interval, sometimes referred to as the quantum. When the Scheduler determines that a thread’s time-slice has expired, the thread is also subject to preemption in favor of another thread from the ready queue of equal priority.) Meanwhile, the hardware manufacturers have improved the performance of the rdtsc instruction, making it considerably more efficient for the Scheduler to gather these processor utilization measurements.
You will notice in the example screen shot shown here in Figure 1 that Resource Manager has calculated that System Interrupts (from ISR and DPC routines) accounts for most of the processor utilization during the last 60 second interval. The machine being monitored is mainly doing file I/O, so this is expected. Although the Resource Monitor display doesn’t say so explicitly, it is worth noting that this is all work being performed in kernel mode.
Conceptually, a context switch event is something like switch(oldThreadId, newThreadId), with an rdtsc time stamp identifying when the context switch occurred. The Context Switch event also provides the old thread’s Wait Reason code, which helps you to understand why the sequence of thread scheduling events occurred. For reference, a Windows context switch is defined here, while the contents of the ETW (Event Tracing for Windows) context switch event record are defined here. Also, see the series of articles by Insung Park and Alex Bendetov that were published in MSDN Magazine entitled “Core OS Events in Windows 7” for additional background and perspective.
(Note: You have to hunt a bit to find the current values of the KWait_Reason enumeration. Officially, the KWait_Reason enumeration is published in the Wdm.h header file available in the Windows Driver Kit (WDK). Unfortunately, in MSDN, ordinarily the most authoritative places to find something like this, the available documentation tends to lag recent releases of Windows. For instance, the version of the enum provided to the .NET developer available here is considerably out of date. The explain text in Perfmon for the Thread\Thread Wait Reason counter is also not current. If you do not have access to the WDK, try either ProcessHacker or NirSoft instead for more up-to-date documentation.)
The same calculation that the Resource Manager in Windows 7 uses can be performed after the fact using event data from ETW. In their article, Insung and Alex write, “In state machine construction, combining Context Switch, DPC and ISR events enables a very accurate accounting of CPU utilization.” This is the technique used in the Windows Performance Toolkit (WPT, which is also known as xperf), for example, to calculate its CPU utilization statistics. (For reference, there is a discussion that illustrates using the WPT to analyze ISR and DPC event data in a previous blog entry entitled “Mainstream NUMA and the TCP/IP stack” posted here.)
The new Concurrency Visualizer, which my colleague Hazim Shafi discusses in a recent MSDN Magazine article entitled “Performance Tuning with the Concurrency Visualizer in Visual Studio 2010 in the Visual Studio 2010 Profiler”, also consumes Context Switch events to calculate processor utilization for the application being profiled. An interesting twist in the Concurrency Visualizer’s CPU Utilization View is that the view pivots based on the application you are profiling, which is likely how a developer engaged in a performance investigation wants to see things. Based on the sequence of context switch trace events, the Concurrency Visualizer calculates processor utilization by the process, aggregates it for the current selection window, and displays it in the CPU Utilization View. In the CPU Utilization View, all other processor activity for processes (other than one being profiled) is lumped together under a category called “Other Processes.” System-processes and the “Idle process,” which is a bookkeeping mechanism, not an actual process that is dispatched, are also broken out separately. See Hazim’s article for more details.
Beyond its CPU utilization calculations, the Concurrency Visualizer’s primary focus is on being able to reconstruct the sequence of events that impact an application’s execution progress. The Concurrency Visualizer’s Threads View is the main display showing an application’s execution path. The view here is of execution progress on a thread by thread basis. For each thread in your application, the Concurrency Visualizer shows the precise sequence of context switch events that occurred. These OS Scheduler events reflect that thread’s execution state. See Figure 2 for an example of this view.
Figure 2 shows the execution path of six application threads, a Main thread, a generic worker thread, and 4 CLR worker threads that the application created by instantiating a ThreadPool object. (There were originally more threads than this, but I have hidden some that were inactive over the entire run.) For each thread, the execution state of the thread – whether it is running or whether it is blocked –is indicated over time. The upper part of the display is a timeline that shows the execution state of each thread over time. The execution progress of each thread display is constructed horizontally from left to right from rectangles that indicate the start and end of a particular thread state. An interval when the thread was running shows as green. An interval where the thread is sleeping is shown in blue. A ready thread that is blocked from executing because a higher priority thread is running is shown in yellow. (This state is labeled “preemption.”) A thread in a synchronization delay waiting on a lock is shown as red.
On the lower left of the display is a Visible Timeline Profile. This summarizes the state of all threads that are visible within the selected time window. In the screen shot in Figure 2, I have zoomed into a window that is approximately 150 milliseconds wide. During that interval, the threads shown were in a state where they were actively executing instruction only 11% of the time. For 25% of the time interval, threads were blocked waiting on a lock. There is a tabbed display at the lower right. If you click on the “Profile Report” tab, a histogram displays that summarizes the execution state of each individual thread over the time window. In the screen shot, I have clicked on the “Current stack,” which displays the call stack associated with the ETW context switch event. If the thread is blocked, the call stack indicates where in the code the thread will resume execution once it unblocks. We will drill into that call stack in a moment.
Note: The Threads View also displays call stacks from processor utilization samples that ETW gathers on a system-wide basis once per millisecond. These are available during any periods when the thread is executing instructions (and ETW execution sampling is active).
One of the other ETW events that the Concurrency Visualizer analyzes is the ReadyThread event. The interval between a ReadyThread event and a subsequent Context Switch that signals that a ready Thread is being dispatched measures CPU queue time delay directly. Using event data, it is possible to measure CPU queuing delays to a degree of precision that far exceeds anything that can be done using performance counters.
The Concurrency Visualizer screen shot in Figure 2 illustrates the calculation of CPU queue time delay. Thread 6920, which happens to be a CLR thread pool worker thread, is shown at a point in time where it was preempted by a higher priority task. The specific delay that I zoomed in on in the screen shot is preemption due to the scheduling of a high priority LPC or ISR – note this category also encompasses assorted APCs and DPCs. In this specific example, execution of Thread 6920 was delayed for 0.7718 milliseconds. According to the trace, that is the amount of time between Thread 6920 being preempted by a high priority system routine and a subsequent context switch where the ready thread was finally re-dispatched.
Figure 2. Screen shot of the Concurrency Visualizer illustrating thread preemption by a higher priority system routine. (Click for full size image.)
The tool also displays the call stack of the preempted thread. The call stack indicates that the CLR’s garbage collector (GC) was running at the time that thread execution was preempted. From the call stack, it looks like the GC is sweeping the Large Object Heap (LOH), trying to free up some previously allocated virtual memory. This is not an opportune time to get preempted. You can see that one of the other CLR worker threads, Thread 6420, is also delayed. Notice from the color coding that Thread 6420 is delayed waiting on a lock. Presumably, one of the other active CLR worker threads in the parent process holds the lock that Thread 6420 is waiting for.
This is one of those “Aha” moments. If you click on the synchronization delay that Thread 6420 is experiencing, as illustrated in Figure 3, you can see that the lock that Thread 6420 is trying to acquire is, in fact, currently held by Thread 6920. Clicking on the tab that says “Current Stack” (not shown) indicates that the duration of the synchronization delay that Thread 6420 suffered in this specific instance of lock contention was about 250 milliseconds.
The scenario here shows one CLR worker thread blocked on a lock that is held by another CLR worker thread, which in turn finds itself being delayed due to preemptions from higher priority Interrupt processing. We can see that whatever high priority work preempted Thread 6920 has the side effect of also delaying Thread 6420, since 6420 was waiting on a lock that Thread 6920 just happened to be holding at the time. The tool in Figure 3 displays the Unblocking stack from Thread 6920 which shows the original memory allocation from the Dictionary.Resize() method call being satisfied, releasing a global GC lock. When Thread 6920 resumed execution following its preemption, the GC operation completes, releasing the global GC lock. Thread 6920 continues to execute for another 25 microseconds or so, before it is preempted because its time slice expired. Even as Thread 6920 blocks, Thread 6420 continues to wait while a different CLR thread pool thread (4664) begins to execute instead. Finally, after another 25 microseconds delay, Thread 6420 resumes execution. For a brief period both 6420 and 4664 execute in parallel from approximately the 7640 to 7650 microsecond milestones. (However, they are subject to frequent preemptions during that period of overlapped execution.) Welcome to the wonderful world of indeterminancy in concurrent programming.
Figure 3. CLR Worker Thread 6420 blocked because it is waiting on a GC lock that happens to be held by Thread 6920, which is subject to preemption by higher priority system routines. (Click for full size image.)
I won’t take the time now to go into what this little concurrent CLR thread pool application is doing. Suffice to say that I wrote it to illustrate some of the performance issues developers can face trying to do parallel programming, which is the other topic I have been blogging about. (I should note that the test program puts the worker threads to sleep periodically to simulate synchronous I/O waits.) As I started to run the test app I developed using the Concurrency Visualizer, I was able to see blocking issues like this one where the Common Language Runtime introduced synchronization and locking considerations that are otherwise opaque to the developer. I eventually tweaked the test app into an especially ghoulish version I call the LockNestMonster program to shine an even brighter light on these issues. (More about this later when I resume blogging about concurrent programming in .NET.)
The Concurrency Visualizer also breaks out preemption due to the expiration of a thread’s time-slice, the duration of a time slice being one of the few tuning adjustments available in the OS. For the record, I normally recommend that system administrators not fiddle with this tuning knob, unless they have a whole lot of extra time on their hands. (This older KB article provides some flavor for what is involved.) For those of you that cannot resist the temptation, the Concurrency Visualizer Threads View is the first Windows performance tool that can help you determine if changing the OS default time-slice value is doing your application any good, or harm, for that matter.
In Figure 4, I clicked on the large yellow block on the right hand side of the execution time bar graph for Thread 6920 indicating another long delay. As in Figure 3, I have hidden all but the three active CLR thread pool threads. Using a combination of zooming to a point of interest in the event stream and filtering out extraneous threads, Figure 4 shows that the Concurrency Visualizer computes an execution time profile for just those events that are visible in the current window. Overall, the three active CLR worker threads are only able to execute 18% of the time, while they are delayed by synchronization 9% of the time and subject to preemption 39% of the time. (You can click on the Profile Report tab in the lower right portion of the display and see a profile report by thread.)
Figure 4. Using the Concurrency Visualizer to drill into thread preemption delays. (Click for full size image.)
At the point indicated in the tool, the time-slice quantum for Thread 6920 expired and the Scheduler preempted the executing thread in favor of some other ready thread. Looking at the visualization, it should be apparent that the ready thread the Scheduler chose to execute next was another CLR thread pool worker thread, namely Thread 4664, which then blocked Thread 6920 from continuing. The tool reports that a context switch(6920, 4664) occurred, and that Thread 6920 was delayed for about 275 milliseconds before it resumed execution as a result of being preempted.
As illustrated in this example, the Concurrency Visualizer uses the ETW-based event data from a profiling run to construct a state machine that reflects the precise execution state of each application thread over the time interval being monitored. It goes considerably beyond calculating processor queue time at the thread level. It understands how to weave the sequence of Ready Thread and Context switch events together to create this execution time profile. It summarizes the profiling data, calculating the precise amount time of time each thread is delayed by synchronous IO, page faults (i.e., Memory Management overhead), processor contention, preemption by priority work, and lock contention over the profiling interval. Furthermore, it analyzes the call stacks gathered at each Context Switch event, looking for signatures that identify the specific blocking reason. And, specifically, to help with lock contention issues, which are otherwise often very difficult to identify, it also identifies the thread that ultimately unblocks the thread that was found waiting to acquire a lock.
I will discuss another new facility for capturing CPU time accurately at the thread level as your program executes in the next blog post in this series.
-- Mark Friedman