June 2010

Volume 25 Number 06

Thread Performance - Resource Contention Concurrency Profiling in Visual Studio 2010

By Maxim Goldin | June 2010

As multicore processors become ever more commonplace, software developers are building multithreaded applications that take advantage of the additional processing capacity to achieve better performance. Using the power of parallel threads, you can split the overall work into separate tasks and execute those tasks in parallel.

Threads, however, often need to communicate with each other to complete a task, and sometimes need to synchronize their behavior if an algorithm or data access requires it. For example, simultaneous write access to the same data should be granted to threads in a mutually exclusive fashion to avoid data corruption.

Synchronization is frequently accomplished through the use of shared synchronization objects, where the thread acquiring the object is granted either shared or exclusive access to the sensitive code or data. When access is no longer required, the thread relinquishes ownership and other threads can attempt to acquire access. Depending on the type of synchronization used, simultaneous requests for ownership might allow multiple threads to access shared resources at the same time, or some of the threads might be blocked until the object is released from previous acquisition. Examples 
include critical sections in C/C++ that use EnterCriticalSection and LeaveCriticalSection access routines, the WaitForSingleObject function in C/C++ and the lock statement and the Monitor class in C#.

The choice of synchronization mechanism must be made with care, because improper synchronization between threads can reduce rather than enhance the performance gains that are the objective of multithreading. Thus, it is increasingly important to be able to detect situations where threads are blocked due to lock contentions that make no progress.

The performance tools in Visual Studio 2010 include a new profiling method—resource contention profiling—that helps you detect concurrency contention among threads. You can find a great first look at this feature in John Robbins’ Wintellect blog post at wintellect.com/CS/blogs/jrobbins/archive/2009/10/19/vs-2010-beta-2-concurrency-resource-profiling-in-depth-first-look.aspx.

In this article, I walk through a contention-profiling investigation and explain the data that can be collected using both the Visual Studio 2010 IDE and command-line tools. I’ll also show you how you can analyze the data in Visual Studio 2010, and you’ll see how to move from one analysis view to another while conducting your contention investigation. Then I will fix the code and compare profiling results of the modified application with the original profiling results to validate that the fix reduces the number of contentions.

Start with the Problem

As an example, I’ll use the same matrix multiplication application Hazim Shafi used in his blog post “Performance Pattern 1: Identifying Lock Contention” (blogs.msdn.com/hshafi/archive/2009/06/19/performance-pattern-1-identifying-lock-contention.aspx). The code example is written in C++, but the concepts I’ll discuss are equally applicable to managed code.

The example matrix multiplication application uses several threads to multiply two matrixes. Each thread gets a portion of the job and runs the following code snippet:

for (i = myid*PerProcessorChunk; 
     i < (myid+1)*PerProcessorChunk; 
     i++) {
  EnterCriticalSection(&mmlock);
  for (j=0; j<SIZE; j++) {
    for (k=0; k<SIZE; k++) {
      C[i][j] += A[i][k]*B[k][j];
    }
  }
  LeaveCriticalSection(&mmlock);
}

Each thread has its own ID (myid), and is responsible for calculating the number of rows (one or more) in the resulting matrix C, using matrixes A and B as an input. Close code inspection shows that no truly ambiguous write-sharing happens, and each thread writes to a different row of C. Yet the developer decided to guard the assignment to the matrix with a critical section. I thank the developer for this, as it gives me a good opportunity to demonstrate the new Visual Studio 2010 performance tools to easily find the redundant synchronization.

Profiling Data Collection

Assuming you have a Visual Studio project with the code shown earlier (although it’s not required—you can attach the profiler to any application that’s already running), you start contention profiling by clicking Launch Performance Wizard on the Analyze menu.

On the first page of the wizard, shown in Figure 1, choose Concurrency and make sure the “Collect resource contention data” option is checked. Note that resource contention concurrency profiling works on any version of the Windows OS. The “Visualize the behavior of a multithreaded application” option requires Windows Vista or Windows 7.

image: Enabling Concurrency Resource Profiling
Figure 1 Enabling Concurrency Resource Profiling

On the second page of the wizard, make sure the current project is targeted. On the last page of the wizard, make sure “Launch profiling after the wizard finishes” option is checked, then click Finish. The application starts running under the profiler. When it exits, the profiling data file appears in the Performance Explorer window (see Figure 2).

image: Performance Profiling Result File in Performance Explorer
Figure 2 Performance Profiling Result File in Performance Explorer

The profiling report automatically opens in Visual Studio and displays the performance investigation results in the Summary view, which is shown in Figure 3.

image: Summary View of the Profiling Report
Figure 3 Summary View of the Profiling Report

Profiling Data Analysis

Not all synchronization causes lock contention. If a lock is available, an attempt to take an ownership of the lock does not block thread execution and no contention happens. In Resource Contention Profiling mode, the profiler collects data only for synchronization events that cause contention and does not report successful (unblocked) resource acquisitions. If your application does not cause any contentions, no data will be collected. If you get data, it means your application has lock contentions.

For each contention, the profiler reports which thread was blocked, where the contention occurred (resource and call stack), when the contention occurred (timestamp) and the amount of time (length) that the thread was blocked trying to acquire a lock, enter a critical section, wait for a single object, and so on.

When you open the file, you’ll first see the Summary view (Figure 3), with three main areas you can use for brief diagnostics:

  1. The contentions chart shows the number of contentions per second plotted for the lifetime of your application. You can visually inspect contention spikes or select a time interval and either zoom into it or filter the results. Filtering re-analyzes the data and removes data outside the selected interval.
  2. The Most Contended Resources table lists the resources that caused the most detected contentions.
  3. The Most Contended Threads table lists the threads with the highest number of contentions. This table uses the number of contentions as a criterion, not the length of the contentions. Therefore, you might have a thread that’s blocked in a single contention for a long time, but it won’t be displayed in Summary view. On the other hand, a thread that experienced many very short contentions, with each contention blocking the thread for only a very short time, would be presented in the Summary view.

If you see a resource that’s responsible for the majority of contentions, inspect that resource in more detail. If you observe a thread that experiences a large number of contentions you did not expect, inspect the contentions of the thread.

For example, in Figure 3, you can see that Critical Section 1 is responsible for nearly all (99.90 percent) contentions in that application. Let’s investigate that resource more closely.

Resource names and thread IDs on the Summary view are hyperlinks. Clicking on Critical Section 1 transfers you to the Resource Details view (see Figure 4), where the context is set to the specific resource—Critical Section 1.

image: Resource Details View
Figure 4 Resource Details View

Resource Details

The upper part of the Resource Details view shows a time-based chart where each horizontal line belongs to a thread. The lines are labeled by the thread root function unless you name the managed thread in your code (for example, by using the C# System.Threading.Thread.Name property). A block on this line represents a contention of the thread on the resource. The block length is the contention length. Blocks from different lines might overlap in time, which means several threads blocked on the resource at the same time.

The Total line is special. It doesn’t belong to any specific thread, but contains all contentions of all threads on this resource (it is actually a projection of contention blocks to the line). As you can see, Critical Section 1 was quite busy—it doesn’t seem to have any empty slots on its Total line.

You can zoom into a specific portion of the chart by selecting a time range using the left mouse button (left-click at the point in the chart you want to start and then drag the pointer to the right). There are two links on the upper-right part of the chart—Zoom reset and Zoom out. Zoom reset restores the original chart view. Zoom out takes you back step by step, un-zooming the chart the same way you zoomed in.

The overall pattern of contention blocks might lead you to some conclusions about your application execution. For example, you can see that contentions of various threads are heavily overlapped in time, which hints at a less than optimal parallelization. Each thread is blocked on the resource much longer than it’s running, and it’s yet another indication of the application’s inefficiency.

Function Details

The bottom part of the Resource Details view is a contention call stack—no data is displayed until you select a specific contention. When you select a block, the corresponding stack shows up in the bottom panel. You can also hover over a contention block on the chart without clicking on it, and a pop-up window will give you the stack and the contention length.

As you can see in the contention call stack, one of the example app functions called MatMult is listed, so you know it was the cause of the contention. To determine which line of the function code is responsible for the contention, double-click the function name in the call stack panel. That takes you to the Function Details view, shown in Figure 5.

image: Function Details View
Figure 5 Function Details View

In this view you see a graphical presentation of functions that called MatMult, as well as functions that were called inside of it. The bottom section of the view clearly shows that EnterCriticalSection(&mmlock) is responsible for the threads being blocked all the time.

When you know which line of your code is responsible for contentions, you may reconsider your decision to implement synchronization that way. Is it the best way to protect your code? Is protection required at all?

In the example app, using a critical section in this code is unnecessary because threads don’t share writes to the same result matrix rows. The Visual Studio performance tools bring you to the point where you can comment out the use of mmlock, significantly speeding up the application. If only it were always that easy!

For a more in-depth description of Function Details view, see the Visual Studio Profiler Team blog at blogs.msdn.com/profiler/archive/2010/01/19/vs2010-investigating-a-sample-profiling-report-function-details.aspx.

Thread Details

As I mentioned earlier, Summary view provides a good starting point for your investigation. By looking at the Most Contended Resources and Most Contended Threads tables, you can decide how to proceed. If you find that one of the threads looks suspicious because you didn’t expect it to be in the top list of contended threads, you might decide to give the thread a closer look.

Click the thread ID on the Summary view to jump to the Thread Details view (see Figure 6). Though this view looks similar to the Resource Details view, it has a different meaning because it displays contentions in the context of the selected thread. Each horizontal line represents a resource the thread was contending for during the thread lifetime. On this chart you won’t see contention blocks overlapping in time because that would mean the same thread was blocked on more than one resource at the same time.

image: Thread Details View with Selected Contention Block
Figure 6 Thread Details View with Selected Contention Block

Note that WaitForMultipleObjects (which I’m not showing here) is handled separately and is represented with a single chart line for the set of objects. This is because the profiler treats all parameter objects of WaitForMultipleObjects as a single entity.

Any manipulations you can do in Resource Details view (zooming the chart in and out, selecting specific contentions and viewing the length in milliseconds, and the calling stack) are applicable to the Thread Details view as well. Double-click the function name in the Contention Call Stack panel to navigate to the Function Details view of that function.

In the example you can see that the thread spends more time being blocked than running in the early part of the execution, and then it’s blocked for a long time on some set of multiple handles. As the last block is caused by waiting for other threads to complete, early contentions indicate non-optimal thread usage, causing the thread to be in a blocked state more than in an executing state.

Chasing Down the Problem

As you might have noticed, the chart axis labels are hyperlinks. This allows switching between detailed views of resources and threads, setting the required context for the view each time. This can be helpful in an iterative approach to finding and solving a problem. For example, you can inspect resource R1 that blocked many threads. You can go from the Resource Details view to a detailed view of thread T1 and find out that it was blocked not only on R1, but sometimes also on resource R2. You can then dive into the details of R2 and observe all threads that were blocked by R2. Next you can click on the label of thread T2 that draws your attention to check all resources that blocked T2, and so on.

Contention profiling data won’t give you an explicit answer to the question of who is holding a lock at any given time. But  given fair use of synchronization object between threads, and your knowledge of the application’s behavior, you can identify a possible lock owner (a thread that succeeded in synchronization lock acquisition) by pivoting your data from resource details to thread details and back.

For example, suppose in the Thread Details view you see a thread T that’s blocked on resource R at time t. You can switch to the Resource Details view of R by clicking on the R label, and see all threads that were blocked on R during the application lifetime. At time t you’ll see a number of them (including T) blocked on R. A thread that’s not blocked on R at time t is a possible lock holder.

I noted earlier that the chart’s Total line is a projection of all contention blocks. The Total label is also a hyperlink, but from Resource Details view it takes you to the Contention view (see Figure 7), which is a collection of contention call trees per resource. The hot path of the appropriate resource call tree is activated for you. This view shows contentions and blocking-time statistics for each resource and for each node (function) in the resource call tree. Unlike the other views, this one aggregates contention stacks to the resource call tree, just like in other profiling modes, and gives you statistics for the whole run of the application.

image: Contention View with Hot Path Applied to Critical Section 1
Figure 7 Contention View with Hot Path Applied to Critical Section 1

 

From the Contention view, you can go back to the Resource Details view of any resource using the context menu. Point to a resource, right-click your mouse and select Show Contention Resource Details. Other interesting actions are also available in the context menu. As a general suggestion, explore the context menus in Profiler views—they can be quite helpful!

Click the Total label of the Thread Details view to display the Processes view, where the thread is selected (see Figure 8). In this view you can see when the thread was started relative to the application start time, when it was terminated, how long it ran, how many contentions it experienced, and how long it was blocked across all contentions (in milliseconds, and in percentage of the thread’s lifetime).

image: Processes View
Figure 8 Processes View

 

Again, it is possible to come back to the Thread Details view for any thread by using the context menu—select a thread of interest, right-click and select Show Thread Contention Details.

Another possible investigation flow would be to display the Processes view directly when the file is opened, sort the threads by clicking on the title of one of the available columns (for example, sorting threads by the number of contentions), select one of the threads and then switch to the contention details chart of the thread through the context menu.

Fix the Problem and Compare Results

After you find the root cause of the lock contentions in the application, you can comment out the mmlock critical section and then rerun profiling:

for (i = myid*PerProcessorChunk; 
     i < (myid+1)*PerProcessorChunk; 
     i++) {
  // EnterCriticalSection(&mmlock);
  for (j=0; j<SIZE; j++) {
    for (k=0; k<SIZE; k++) {
      C[i][j] += A[i][k]*B[k][j];
    }
  }
  // LeaveCriticalSection(&mmlock);
}

You would expect the number of contentions to decrease, and indeed profiling of the modified code reports only one lock contention, as shown in Figure 9.

Figure 9 Summary View of Profiling Results for Fixed CodeFigure 9 Summary View of Profiling Results for Fixed Code

We can also compare the new and previous performance results in Visual Studio. To do this, select both files in Performance Explorer (select one file, press Shift or Ctrl and then select another), then right-click and select Compare Performance Reports.

A Comparison Report appears, as shown in Figure 10. In the example app, you can see that number of Inclusive Contentions of the MatMult function dropped from 1,003 to 0.

image: Comparison Report
Figure 10 Comparison Report

Alternative Data Collection Methods

If you create your performance session for either Sampling or Instrumentation profiling, you can always convert it later to Concurrency mode. One way to do it quickly is to use the profiling mode menu in the Performance Explorer. Just select the mode you’d like to be in, and you’re good to go.

You can also go through the Properties setting of your session. Point to your session in Performance Explorer, right-click to display the context menu, then select Properties. The General tab of Property Pages gives you control over your profiling session mode and other profiling parameters.

Once your profiling mode is set for Concurrency (or Sampling, for that matter), you can either launch your application (it’s already in your Targets list if you used Performance Wizard, or you can add it there manually), or you can attach to an application that’s up and running. Performance Explorer gives you the controls to do these tasks, as Figure 11 shows.

image: Profiling Controls of Performance Explorer
Figure 11 Profiling Controls of Performance Explorer

The Visual Studio UI automates a number of the steps necessary to collect profiling data. However, it is possible to collect profiling data by using command-line tools, which can be useful for automated runs and scripts.

To launch your application in contention profiling mode, open the Visual Studio command prompt (which puts all profiler binaries in your path, either x86 or x64 tools), and then do the following:

  1. VSPerfCmd.exe /start:CONCURRENCY,RESOURCEONLY /output:<YourOutputFile>

  2. VSPerfCmd.exe /launch:<Your Application> /args:“<Your Application Arguments>”

  3. Run your scenario

  4. VSPerfCmd.exe /detach

    • This step is not required if your application terminates, but it causes no harm so you can add it to your scripts.
  5. VSPerfCmd.exe /shutdown

    Now you can open YourOutputFile.VSP in Visual Studio for analysis.

If you have an application that’s running already, you can attach the profiler to it using these steps:

  1. VSPerfCmd.exe /start:CONCURRENCY,RESOURCEONLY /output:<YourOutputFile>
  2. VSPerfCmd.exe /attach:<PID or Process Name>
  3. Run your scenario
  4. VSPerfCmd.exe /detach
  5. VSPerfCmd.exe /shutdown

A more detailed explanation of available command-line options can be found at msdn.microsoft.com/library/bb385768(VS.100).

A variety of Visual Studio views let you closely inspect the collected data. Some views give a picture of the application lifetime in whole, while others focus on specific contentions—use those you find most valuable.

When you analyze the results of profiling, you can use transitions from one view to another through hyperlinks, double-clicks or context menus, or you can switch directly to any available view through a drop-down menu. Figure 12 briefly describes each of the views.

Figure 12 Analysis Views

View Description
Summary Summary information is presented to serve as a starting point for your investigation. This is the first view you see, and it opens automatically after a profiling session is over and the result file is ready.
Call Tree An aggregated call tree of all contention stacks. Here you can see which stacks were responsible for your contentions.
Modules A list of modules that contain functions, each resulting in a contention. Each module has a list of relevant functions and the number of detected contentions.
Caller/Callee A three-panel view that presents function F, all functions that call F, and functions that are called by F (only calls resulted in contentions, of course).
Functions A list of all detected functions on any contention stack, with associated data.
Lines The function lines in the source code.
Resource Details Details about a specific resource (for example, a lock), showing all threads that were blocked on it during the application lifetime.
Thread Details Details about a specific thread, showing all resources (such as locks) the thread was blocked on.
Contention Similar to the call tree view, but here call trees are separated per contention resource. In other words, this view presents a set of call trees, each containing stacks that were blocked on a specific resource.
Marks A list of automatically and manually recorded marks, where each mark is associated with its timestamp and the values of Windows counters.
Processes A list of inspected processes, where each process has a list of its threads, and each thread is attributed with the number of contentions it experienced and the summarized length of blocked time.
Function Details Details about a specific function, including the functions it calls and collected data.
IPs A list of instruction pointers where contention happened (well, a list of functions like EnterCriticalSection, WaitForSingleObject and so on, because this is where contention actually happens).

The new resource contention profiling features in Visual Studio  should help you discover performance issues by using thread synchronization in the code, and allow you to improve your application runtime by changing, reducing or eliminating unnecessary synchronization.                                    


Maxim Goldin is a senior software design engineer at Microsoft. He has worked on the Visual Studio Engineering team since 2003. He can be reached at mgoldin@microsoft.com, and he blogs at blogs.msdn.com/b/mgoldin.

For additional information about Performance Investigation Flows see my companion blog post at blogs.msdn.com/b/mgoldin/archive/2010/04/22/resource-contention-concurrency-profiling-in-visual-studio-2010-performance-investigation-flows.aspx.

Thanks to the following technical experts for reviewing this article: Steve Carroll, Anna Galaeva, Daryush Laqab, Marc Popkin-Paine, Chris Schmich and Colin Thomsen