March 2010

Volume 25 Number 03

Thread Diagnostics - Performance Tuning with the Concurrency Visualizer in Visual Studio 2010

By Hazim Shafi | March 2010

Multicore processors have become widely available, and single-threaded performance in new processors is likely to remain relatively flat. That means added pressure on software developers to improve application performance by taking better advantage of parallelism.

Parallel programming is challenging for many reasons, but in this article I’d like to focus on the performance aspects of parallel applications. Multithreaded applications are not only prone to common sources of inefficiency in sequential implementations, such as inefficient algorithms, poor cache behavior, and excessive I/O, but they can also suffer from parallel performance bugs. Parallel performance and scalability may be limited by load imbalance, excessive synchronization overhead, inadvertent serialization, or thread migration.

Understanding such performance bottlenecks used to require significant instrumentation and analysis by expert developers. Even for those elite programmers, performance tuning was a tedious and time-consuming process.

This is about to change for the better. Visual Studio 2010 includes a new profiling tool—the Concurrency Visualizer—that should significantly reduce the burden of parallel performance analysis. Moreover, the Concurrency Visualizer can help developers analyze their sequential applications to discover opportunities for parallelism. In this article, I present an overview of the features of the Concurrency Visualizer in Visual Studio 2010, along with some practical usage guidance.

CPU Utilization

The Concurrency Visualizer comprises several visualization and reporting tools. There are three main views: CPU Utilization, Threads, and Cores.

The CPU Utilization view, shown in Figure 1, is intended to be the starting point in Concurrency Visualizer. The x axis shows the time elapsed from the start of the trace until the end of application activity (or the end of the trace, whichever is earlier). The y axis shows the number of logical processor cores in the system.


Figure 1 CPU Utilization View

Before I describe the purpose of the view, it is important that you understand what a logical core is. A single CPU chip today can include multiple microprocessor circuits, referred to as physical cores. Each physical core may be capable of running multiple application threads simultaneously. This is often referred to as simultaneous multithreading (SMT); Intel calls it Hyper-Threading Technology. Each hardware-supported thread on an SMT-capable core presents itself as a logical core to the operating system.

If you collect a trace on a quad-core system that does not support SMT, the y axis would show four logical cores. If each core in your quad-core system is capable of running two SMT threads, then the y axis would show eight logical cores. The point here is that the number of logical cores is a reflection of the number of threads that can simultaneously execute in your system, not the number of physical cores.

Now, let’s get back to the view. There are four areas shown in the graph, as described in the legend. The green area depicts the average number of logical cores that the application being analyzed is using at any given time during the profiling run. The rest of the logical cores are either idle (shown in gray), used by the System process (shown in red), or used by other processes running on the system (shown in yellow).

The blue vertical bars in this view correspond to an optional mechanism that allows users to instrument their code in order to correlate the visualizations in the tool with application constructs. I will explain how this can be done later in this article.

The Zoom slider control at the top left allows you to zoom in on the view to get more details, and the graph control supports a horizontal scrollbar when zoomed. You can also zoom by clicking the left mouse button and dragging in the area graph itself.

This view has three main purposes. First, if you are interested in parallelizing an application, you can look for areas of execution that either exhibit significant serial CPU-bound work, shown as lengthy green regions at the single-core level on the y axis, or regions where there isn’t much CPU utilization, where the green doesn’t show or is considerably less than 1 on average. Both of these circumstances might indicate an opportunity for parallelization. CPU-intensive work can be sped up by leveraging parallelism, and areas of unexpected low CPU utilization might imply blocking (perhaps due to I/O) where parallelism may be used by overlapping other useful work with such delays.

Second, if you are trying to tune your parallel application, this view allows you to confirm the degree of parallelism that exists when your application is actually running. Hints of many common parallel performance bugs are usually apparent just by examining this graph. For example, you can observe load imbalances as stair-step patterns in the graph, or contention for synchronization objects as serial execution when parallelism is expected.

Third, since your application lives in a system that may be executing many other applications that are competing for its resources, it is important to understand whether your application’s performance is affected by other apps. When interference is unexpected, it is usually a good idea to reduce it by disabling applications or services to improve the fidelity of data, because performance is usually an iterative process. Sometimes, interference is caused by other processes with which your application collaborates to deliver an experience. Either way, you will be able to use this view to discover whether such interference exists, and then identify the actual processes involved by using the Threads view, which I will discuss later.

Another feature that can help reduce interference is using the profiler command-line tools to collect traces rather than doing so from within the Visual Studio IDE.

Focus your attention on some window of execution that piques your interest, zoom in on it, and then switch to the Threads view for further analysis. You can always come back to this view to find the next region of interest and repeat the process.

Threads

The Threads view, shown in Figure 2, contains the bulk of the detailed analysis features and reports in the Concurrency Visualizer. This is where you’ll find information that explains behavior you identified in the CPU Utilization or Cores views. It is also where you can find data to link behavior to application source code when possible. There are three main components of this view: the timeline, the active legend and the reporting/details tab control.

Like the CPU Utilization view, the Threads view shows time on the x axis. (When switching between views in Concurrency Visualizer, the range of time shown on the x axis is preserved.) However, the Threads view y axis contains two types of horizontal channels.

The top channels are usually dedicated to physical disks on your system if they had activity in your application’s profile. There are two channels per disk, one each for reads and writes. These channels show disk accesses that are made by your application threads or by the System process threads. (It shows the System accesses because they can sometimes reflect work being done on behalf of your process, such as paging.) Every read or write is drawn as a rectangle. The length of the rectangle depicts the latency of the access, including queuing delays; therefore, multiple rectangles may overlap.

To determine which files were accessed at a given point in time, select a rectangle by clicking the left mouse button. When you do that, the reports view below will switch to the Current Stack tab, which is the standard location for displaying data interactively with the timeline. Its contents will list the names of files that were either read or written, depending on the disk channel selected. I will return to I/O analysis later.

One thing to be aware of is that not all file read and write operations performed by the application may be visible when they are expected to occur. This is because the operating system’s file system uses buffering, allowing some disk I/O operations to complete without accessing the physical disk device.

The remaining channels in the timeline list all the threads that existed in your application during the profile collection period. For each thread, if the tool detected any activity during the profiler run, it will display the state of the thread throughout the trace until it is terminated.

If a thread is running, which is depicted by the green Execution category, the Concurrency Visualizer shows you what the thread was doing by leveraging sample profile information. There are two ways to get at this data. One is by clicking on a green segment, in which case you’ll see the nearest (within +/- 1 ms) profile sample call stack in the Current Stack tab window.

You can also generate a sample profile report for the visible time range to understand where most of the work was spent. If you click on the Execution label in the active legend, the report will show up in the Profile Report tab. The profile report has two features that may be used to reduce complexity. One is a noise reduction feature that, by default, removes call stacks responsible for 2 percent or less of the profile samples. This threshold can be changed by the user. Another feature, called Just My Code, can be used to reduce the number of stack frames due to system DLLs in the report, if that’s desirable. I’ll cover the reports in more detail later.

Before going on, I’d like to point out a few more features for managing complexity in the reports and views. You will often encounter application scenarios consisting of many threads, some of which may not be doing anything useful in a given profiler run. Besides filtering reports based on the time range, the Concurrency Visualizer also allows you to filter by the threads that are active. If you’re interested in threads that do work, you can use the Sort By option to sort the threads by the percentage of time that they are in the Execution state. You can then select the group of threads that are not doing much useful work and hide them from the display either by right-clicking and selecting the Hide option from the context menu or by clicking the Hide button in the toolbar at the top of the view. You can sort by all thread state categories and can hide/unhide as you see fit.

The effect of hiding threads is that their contributions to all the reports will be removed, in addition to hiding their channels from the timeline. All statistics and reports in the tool are kept up-to-date dynamically as filtering is performed on threads and time range.

Blocking Categories

Threads can block for many reasons. The Threads view attempts to identify the reason why a thread blocked by mapping each instance to a set of blocking categories. I say attempts because this categorization can sometimes be inaccurate, as I’ll explain in a moment, so it should be viewed as a rough guide. That said, the Threads view shows all thread delays and accurately depicts execution periods. You should focus your attention on categories responsible for significant delays in the view based on your understanding of the application’s behavior.

In addition, the Threads view provides the call stack at which the thread stopped execution in the Current Stack tab if you click on a blocking event. By clicking on a stack frame in the Current Stack window, the user will be taken to the source code file (when available) and line number where the next function is called. This is an important productivity feature of the tool.

Let’s take a look at the various blocking categories:

Synchronization Almost all blocking operations can be attributed to an underlying synchronization mechanism in Windows. The Concurrency Visualizer attempts to map blocking events due to synchronization APIs such as EnterCriticalSection and WaitForSingleObject to this category, but sometimes other operations that result in synchronization internally may be mapped to this category—even though they might make more sense elsewhere. Therefore, this is often a very important blocking category to analyze during performance tuning, not just because synchronization overheads are important but also because it can reflect other important reasons for execution delays.

Preemption This includes preemption due to quantum expiration when a thread’s share of time on its core expires. It also includes preemption due to OS scheduling rules, such as another process thread with a higher priority being ready to run. The Concurrency Visualizer also maps other sources of preemption here, such as interrupts and LPCs, which can result in interrupting a thread’s execution. At each such event, the user can get the process ID/name and thread ID that took over by hovering over a preemption region and examining the tooltip (or clicking on a yellow region and observing the Current Stack tab contents). This can be a valuable feature for understanding the root causes of yellow interference in the CPU Utilization view.

Sleep This category is used to report thread blocking events as a result of an explicit request by the thread to sleep or yield its core voluntarily.

Paging/Memory Management This category covers blocking events due to memory management, which includes any blocking operations started by the system’s memory manager as a response to an action by the application. Things like page faults, certain memory allocation contentions or blocking on certain resources would show up here. Page faults in particular are noteworthy because they can result in I/O. When you see a page fault blocking 
event, you should both examine the call stack and look for a corresponding I/O read event on the disk channel in case the page fault required I/O. A common source of such page faults is loading DLLs, memory-mapped I/O and normal virtual-memory paging by the kernel. You can identify whether this was a DLL load or paging by clicking on the corresponding I/O segment to get the filename involved.

I/O This category includes events such as blocking on file reads and writes, certain network socket operations and registry accesses. A number of operations considered by some to be network-related may not show up here, but rather in the synchronization category. This is because many I/O operations use synchronization mechanisms to block and the Concurrency Visualizer may not be looking for those API signatures in this category. Just as with the memory/paging category, when you see an I/O blocking event that seems to be related to accessing your disk drives, you should find out if there’s a corresponding disk access in the disk channels. To make this easier, you can use the arrow buttons in the toolbar to move your threads closer to the disk channel. To do this, select a thread channel by clicking on its label on the left, then click on the appropriate toolbar button.

UI Processing This is the only form of blocking that is usually desirable. It is the state of a thread that is pumping messages. If your UI thread spends most of its time in this state, this implies that your application is responsive. On the other hand, if the UI thread does excessive work or blocking for other reasons, from the application user’s perspective the UI will appear to hang. This category offers a great way to study the responsiveness of your application, and to tune it.

Inter-Thread Dependencies

One of the most valuable features of the Threads view is the ability to determine inter-thread synchronization dependencies. In Figure 2 I have selected a synchronization delay segment. The segment gets enlarged and its color is highlighted (in this case, it’s red). The Current Stack tab shows the call stack of the thread at that moment. By examining the call stack, you can determine the API that resulted in blocking the thread’s execution.


Figure 2 Threads View

Another visualization feature is a line that connects the blocking segment to an execution segment on a different thread. When this visualization is visible, it illustrates the thread that ended up unblocking the blocked thread. In addition, you can click on the Unblocking stack tab in this case to see what the unblocking thread was doing when it released the blocked thread.

As an example, if the blocking thread was waiting on a Win32 critical section, you would see the signature of EnterCriticalSection on its blocking call stack. When it is unblocked, you should see the signature of LeaveCriticalSection in the call stack of the unblocking thread. This feature can be very valuable when analyzing complex application behavior.

Reports

The profile reports offer a simple way of identifying major contributors to the performance behavior of your application. Whether you are interested in execution overheads, blocking overheads or disk I/O, these reports allow you to focus on the most significant items that may be worth investigating.

There are four types of reports in the Threads view: execution sampling profiles, blocking profiles, file operations and per-thread summaries. All the reports are accessed using the legend. For example, to get the execution profile report, click the execution legend entry. This produces a report in the Profile Report tab. The reports look similar to what is shown in Figure 3.


Figure 3 A Typical Profile Report

For an execution profile report, the Concurrency Visualizer analyzes all the call stacks collected when sampling your application’s execution (green segments) and collates them by identifying shared stack frames to assist the user in understanding the execution structure of the application. The tool also computes inclusive and exclusive costs for each frame. Inclusive samples account for all samples in a given execution path, including all paths below it. Exclusive samples correspond to the number of samples of call-graph stack-frame leaves.

To get a blocking profile, you click on the blocking category of interest in the legend. The generated report is constructed like the execution profile report, but the inclusive and exclusive columns now correspond to blocking time attributed to the call stacks or frames in the report. Another column shows the number of instances of blocking attributed to that stack frame in the call tree.

These reports offer a convenient way of prioritizing performance tuning efforts by identifying the parts of your application responsible for most delays. The preemption report is informational and usually does not offer any actionable data due to the nature of this category. All the reports allow you to jump to source code. You may do so by right-clicking on a stack frame of interest. The context menu that appears allows you to jump either to the function definition (the View Source option) or to the location in your application where that function was called (the View Call Sites option). If there were multiple callers, you will be presented with multiple options. This allows a seamless integration between the diagnostic data and the development process to tune your application’s behavior. The reports may also be exported for cross-profile comparisons.

The File Operations report shown in Figure 4 includes a summary of all file read and write operations visible in the current time range. For every file, the Concurrency Visualizer lists the application thread that accessed it, the number of read and write operations, the total bytes read or written, and the total read or write latency. Besides showing file operations directly attributed to the application, the Concurrency Visualizer also shows those performed by the System process. These are shown, as mentioned earlier, because they might include file operations performed by the system on behalf of your application. Exporting the report allows cross-profile comparisons during tuning efforts.


Figure 4 File Operations Report

The Per Thread Summary report, shown in Figure 5, presents a bar graph for each thread. The bar is divided into the various thread state categories. This can be a useful tool to track your performance tuning progress. By exporting the graph data across various tuning iterations, you can document your progress and provide a means of comparing runs. The graph will not show all threads for applications that have too many threads to fit within the view.


Figure 5 Per Thread Summary Report

Cores

Excessive context switches can have a detrimental effect on application performance, especially when threads migrate across cores or processor sockets when they resume execution. This is because a running thread loads instructions and data it needs (often referred to as the working set) into the cache hierarchy. When a thread resumes execution, especially on another core, it can suffer significant latency while its working set is reloaded from memory or other caches in the system.

There are two common ways to reduce this overhead. A developer can either reduce the frequency of context switches by resolving the underlying causes, or he can leverage processor or core affinity. The former is almost always more desirable because using thread affinity can be the source of other performance issues and should only be used in special circumstances. The Cores view is a tool that aids in identifying excessive context switches or performance bugs introduced by thread affinity.

As with the other views, the Cores view displays a timeline with time on the x axis. The logical cores in the system are shown on the y axis. Each thread in the application is allocated a color, and thread execution segments are drawn on the core channels. A legend and context switch statistics are shown in the bottom pane, as shown in Figure 6.


Figure 6 Cores View

The statistics help the user identify threads that have excessive context switches and those that incur excessive core migrations. The user can then use this view to focus her attention on areas of execution where the threads in question are interrupted, or jump back and forth across cores by following the visual color hints. Once a region that depicts the problem is identified, the user can zoom in on it and switch back to the Threads view to understand what triggered the context switches and fix them if possible (for example, by reducing contention for a critical section). Thread affinity bugs can also manifest themselves in some cases when two or more threads contend for a single core while other cores appear to be idle.

Support for PPL, TPL and PLINQ

The Concurrency Visualizer supports the parallel programming models shipping in Visual Studio 2010 aside from existing Windows native and managed programming models. Some of the new parallel constructs—parallel_for in the Parallel Pattern Library (PPL), Parallel.For in the Task Parallel Library (TPL) and PLINQ queries—include visualization aids in the performance tool that allow you to focus your attention on those regions of execution.

PPL requires turning on tracing for this functionality to be enabled, as shown in this example:

Concurrency::EnableTracing();
parallel_for (0, SIZE, 1, [&] (int i2) {
  for (int j2=0; j2<SIZE; j2++) {
    A[i2+j2*SIZE] = 1.0;
    B[i2+j2*SIZE] = 1.0;
    C[i2+j2*SIZE] = 0.0;
  }
});
Concurrency::DisableTracing();

When tracing is enabled, the Threads and Cores views will depict the parallel_for execution region by drawing vertical markers at the beginning and end of its execution. The vertical bars are connected via horizontal bars at the top and bottom of the view. By hovering with the mouse over the horizontal bars, a tooltip showing the name of the construct is drawn, as shown in Figure 7.


Figure 7 An Example parallel_for Visual Marker in Threads View

TPL and PLINQ do not require manual enabling of tracing for the equivalent functionality in the Concurrency Visualizer.

Collecting a Profile

The Concurrency Visualizer supports both the application launch and attach methods for collecting a profile. The behavior is exactly the same as users of the Visual Studio Profiler are accustomed to. A new profiling session may be initiated through the Analyze menu option either by launching the Performance Wizard, shown in Figure 8, or via the Profiler | New Performance Session option. In both cases, the Concurrency Visualizer is activated by choosing the Concurrency profiling method and then selecting the “Visualize the behavior of a multithreaded application” option.


Figure 8 The Performance Wizard Profiling Method Dialog

The Visual Studio Profiler’s command-line tools allow you to collect Concurrency Visualizer traces and then analyze them using the IDE. This lets users who are interested in server scenarios where installing the IDE is impossible collect a trace with the least intrusion possible.

You will notice that the Concurrency Visualizer does not have integrated support for profiling ASP.NET applications. However, it may be possible to attach to the host process (usually w3wp.exe) while running your ASP.NET application in order to analyze its performance.

Since the Concurrency Visualizer uses Event Tracing for Windows (ETW), it requires administrative privileges to collect data. You can either launch the IDE as an administrator, or you will be prompted to do so when necessary. In the latter case, the IDE will be restarted with administrator rights.

Linking Visualizations to Application Phases

Another feature in the Concurrency Visualizer is an optional instrumentation library that allows developers to customize the views by drawing markers for application phases they care about. This can be extremely valuable to allow easier correlation between visualizations and application behavior. The instrumentation library is called the Scenario library and is available for download from the MSDN Code Gallery Web site at code.msdn.microsoft.com/scenario. Here’s an example using a C application:

#include "Scenario.h"
int _tmain(int argc, _TCHAR* argv[]) {
  myScenario = new Scenario(0, L"Scenario Example", (LONG) 0);
  myScenario->Begin(0, TEXT("Initialization"));

  // Initialization code goes here

  myScenario->End(0, TEXT("Initialization"));
  myScenario->Begin(0, TEXT("Work Phase"));

  // Main work phase goes here  

  myScenario->End(0, TEXT("Work Phase"));
  exit(0);
}

The usage is pretty simple; you include the Scenario header file and link the correct library. Then you create one or more Scenario objects and mark the beginning and end of each phase by invoking the Begin and End methods, respectively. You also specify the name of each phase to these methods. The visualization is identical to that shown in Figure 7, except that the tooltip will display the custom phase name you specify in your code. In addition, the scenario markers are also visible in the CPU Utilization view, which is not the case for other markers. An equivalent managed 
implementation is also provided.

A word of caution is in order here. Scenario markers should be used sparingly; otherwise, the visualizations can be completely obscured by them. In fact, to avoid this problem, the tool will significantly reduce or eliminate the number of markers displayed if it detects excessive usage. In such cases, you can zoom in to expose markers that have been elided in most views. Further, when nesting of Scenario markers takes place, only the innermost marker will be displayed.

Resources and Errata

The Concurrency Visualizer includes many features to help you understand its views and reports. The most interesting such feature is the Demystify button shown in the top-right corner of all views. By clicking Demystify, you get a special mouse pointer allowing you to click on any feature in view that you’d like help on. This is our way of providing context-sensitive help in the tool.

In addition, there’s a Tips tab with more help content, including a link to a gallery of visualization signatures for some common performance issues.

As mentioned earlier, the tool leverages ETW. Some of the events required by the Concurrency Analyzer do not exist on Windows XP or Windows Server 2003, so the tool only supports Windows Vista, Windows Server 2008, Windows 7 and Windows Server 2008 R2. Both 32-bit and 64-bit variants of these operating systems are supported.

In addition, the tool supports both native C/C++ and .NET applications (excluding .NET 1.1 and earlier). If you are not running on a supported platform, you should explore another valuable concurrency tool in Visual Studio 2010, which is enabled by selecting the “Collect resource contention data” option.

In certain cases, when there’s a significant amount of activity in a profiling scenario or when there is contention for I/O bandwidth from other applications, important trace events may be lost. This results in an error during trace analysis. There are two ways to handle this situation. First, you could try profiling again with a smaller number of active applications, which is a good methodology to follow in order  to minimize interference while you are tuning your application. The command-line tools are an additional option in this case.

Second, you can increase the number or size of ETW memory buffers. We provide documentation through a link in the output window to instructions on how to accomplish this. If you choose option two, please set the minimum total buffer size necessary to collect a good trace since these buffers will consume important kernel resources when in use.

Any diagnostic tool is only as good as the data it provides back to the user. The Concurrency Visualizer can help you pinpoint the root causes of performance issues with references to source code, but in order to do so, it needs access to symbol files. You can add symbol servers and paths in the IDE using the Tools | Options | Debugging | Symbols dialog. Symbols for your current solution will be implicitly included, but you should enable the Microsoft public symbol server as well as any other paths that are specific to the application under study where important symbol files may be found. It’s also a good idea to enable a symbol cache because that will significantly reduce profile analysis time as the cache gets populated with symbol files that you need.

Although ETW provides a low-overhead tracing mechanism, the traces collected by the Concurrency Visualizer can be large. Analyzing large traces can be very time-consuming and may result in performance overheads in the visualizations provided by the tool. Generally, profiles should be collected for durations not exceeding one to two minutes to minimize the chances of these issues affecting your experience. For most analysis scenarios, that duration is sufficient to identify the problem. The ability to attach to a running process is also an important feature in order to avoid collecting data before your application reaches the point of interest.

There are multiple sources of information on the Concurrency Visualizer. Please visit the Visual Studio Profiler forum (social.msdn.microsoft.com/forums/en-us/vstsprofiler/threads) for community and development team answers. Further information is available from the team blog at blogs.msdn.com/visualizeparallel and my personal blog at blogs.msdn.com/hshafi. Please feel free to reach out to me or my team if you have any questions regarding our tool. We love hearing from people using the Concurrency Visualizer, and your input helps us improve the tool.


 

Dr. Hazim Shafi  is the parallel performance and correctness tools architect in the Parallel Computing Platform team at Microsoft. He has 15 years of experience in many aspects of parallel and distributed computing and performance analysis. He holds a B.S.E.E. from Santa Clara University, and M.S. and Ph.D. degrees from Rice University.

Thanks to the following technical experts for reviewing this article: Drake Campbell, Bill Colburn, Sasha Dadiomov and James Rapp