September 2009
Volume 24 Number 09
PARALLEL DEBUGGING - Debugging Task-Based Parallel Applications in Visual Studio 2010
By Daniel Moth | September 2009
Ask any CPU or GPU hardware manufacturer, and they will tell you that the future is manycore. Processor core speeds are not increasing at the exponential rates of the past four decades; instead, new machines are being built with more cores. As a result, the "free" performance improvements that application developers relied on year after year are gone. To regain the free lunch offered by better and better hardware -- and to enhance your applications with new performance-sensitive features -- you need to take advantage of multiple cores via parallelism.
In Visual C++ 10 and the Microsoft .NET Framework 4, both available with Visual Studio 2010, Microsoft is introducing new libraries and runtimes to significantly ease the process of expressing parallelism in your code base, together with new tool support for performance analysis and debugging of parallel applications. In this article, you will learn about debugging support in Visual Studio 2010, much of which is focused on task-based programming models.
The Need for Task-Based Programming
The reason to inject parallelism into your application is to take advantage of multiple cores. A single, sequential piece of work will run on only one core at a time. For an application to use multiple cores, multiple pieces of work are needed to enable multiple threads to process that work in parallel. Thus, given a single piece of work, achieving parallel acceleration through multicore execution requires partitioning that single piece of work into multiple units that can run concurrently.
The simplest schemes involve static partitioning: split the work into a fixed number of fixed-size units. Of course, you don't want to have to write your code for each configuration of hardware it will be executed on, and predetermining a fixed number of units ahead of time inhibits the scalability of your application as it runs on bigger and better machines. You can instead choose the number of units dynamically at run time based on details of the machine. For example, you can partition work into one unit per core. This way, if all the units are equal in size in terms of the processing time they require, and if you use one thread per unit, you should be able to saturate the machine.
This approach, however, still leaves a lot to be desired. It's rare that a real-world workload can be split in such a way that each unit is guaranteed to take the same amount of time, especially when you take into account external factors such as other workloads that might be running on the machine concurrently and consuming some of the machine's resources. In such cases, one-unit-per-core partitioning will likely end up distributing the work unevenly: some threads will complete their units before others, resulting in load imbalance, and some cores will sit idle while others finish. To address this, you want to over-partition the work, dividing the workload into the smallest feasible units so that all of the machine's resources can partake in the processing of the workload until it is complete.
If executing a unit of work incurred zero overhead, the solution just proposed would be ideal, but very few real-world operations incur zero overhead. Historically, threads have been the mechanism for executing such a unit of work: you create a thread for each unit of work, let it execute, and then tear down the thread. Unfortunately, threads are relatively heavy weight, and the overhead incurred from using them in this manner can prohibit the type of over-partitioning we've described. What you need is a lighter-weight mechanism for executing these partitioned units to minimize overhead -- a mechanism that will let you over-partition with less guilt. With this approach, rather than creating one thread per unit, you could utilize a scheduler that schedules individual units to be executed on threads that it manages, keeping the number of units as small as possible while still ensuring maximum throughput.
What we've just described is a thread pool, which amortizes the cost of thread management across all the work items scheduled to it, thus minimizing the overhead associated with an individual work item. In Windows, such a thread pool is accessible through the QueueUserWorkItem function exported from Kernel32.dll. (Windows Vista introduced new thread pooling functionality as well.) In .NET Framework 4, such a pool is accessible through the System.Threading.ThreadPool class.
While the previously mentioned APIs enable decomposition with relatively minimal overhead, they're largely targeted at "fire and forget" work. For example, the .NET Framework 4 ThreadPool class doesn't provide any consistent mechanism for exception handling, for cancellation of work, for waiting on work to complete, for receiving notifications when work completes, and so forth. These gaps are filled by new APIs in .NET Framework 4 and Visual C++ 10 designed for "task-based" programming in both managed and native code. Tasks represent units of work that can be executed efficiently by an underlying scheduler while still exposing rich functionality for working with and controlling aspects of their execution. In Visual C++ 10, these APIs are centered on the Concurrency::task_group and Concurrency::task_handle types. In .NET Framework 4, they are centered on the new System.Threading.Tasks.Task class.
Figure 1 Parallel Stacks Window
Figure 2 Parallel Tasks Window
Debugging in Visual Studio Today
The history of software development has demonstrated time and time again that programming models benefit greatly from exemplary debugging support, and Visual Studio 2010 delivers in this regard by providing two new debugging tool windows to assist with task-based parallel programming. But before we look at these new features, let's review the debugging experience in Visual Studio today to set the stage.
(For the rest of this article, we'll use the .NET task-based types for explanatory purposes, but the debugging support described applies equally to native code as well.)
The entry point to debugging a process in Visual Studio is, of course, attaching the debugger. This occurs by default when you press F5 (the equivalent of choosing the Debug > Start Debugging command) in a project that is open in Visual Studio. You can also manually attach the debugger to a process by choosing the Debug > Attach to Process menu command. Once the debugger is attached, the next step is to break into the debugger. This can occur in multiple ways, including hitting a user-defined break point, manually breaking (via the Debug > Break All command), the process requesting it (for example, in managed code, via a call to the System.Diagnostics.Debugger.Break method), or even when an exception is thrown.
Figure 3 Finding Primes
//C#
static void Main(string[] args)
{
var primes =
from n in Enumerable.Range(1,10000000)
.AsParallel()
.AsOrdered()
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
where IsPrime(n)
select n;
foreach (var prime in primes) Console.Write(prime + “, ”);
}
public static bool IsPrime(int numberToTest) // WARNING: Buggy!
{
// 2 is a weird prime: it’s even. Test for it explicitly.
if (numberToTest == 2) return true;
// Anything that’s less than 2 or that’s even is not prime
if (numberToTest < 2 || (numberToTest & 1) == 0) return false;
// Test all odd numbers less than the sqrt of the target number.
// If the target is divisible by any of them, it’s not prime.
// We don’t test evens, because if the target is divisible
// by an even, the target is also even, which we already checked for.
int upperBound = (int)Math.Sqrt(numberToTest);
for (int i = 3; i < upperBound; i += 2)
{
if ((numberToTest % i) == 0) return false;
}
// It’s prime!
return true;
}
'Visual Basic
Shared Sub Main(ByVal args() As String)
Dim primes = From n In Enumerable.Range(1, 10000000).AsParallel().AsOrdered().WithMergeOptions(ParallelMergeOptions.NotBuffered)
Where IsPrime(n)
Select n
For Each prime In primes
Console.Write(prime + ", ")
Next prime
End Sub
Public Shared Function IsPrime(ByVal numberToTest As Integer) As Boolean ' WARNING: Buggy!
' 2 is a weird prime: it’s even. Test for it explicitly.
If numberToTest = 2 Then
Return True
End If
' Anything that’s less than 2 or that’s even is not prime
If numberToTest < 2 OrElse (numberToTest And 1) = 0 Then
Return False
End If
' Test all odd numbers less than the sqrt of the target number.
' If the target is divisible by any of them, it’s not prime.
' We don’t test evens, because if the target is divisible
' by an even, the target is also even, which we already checked for.
Dim upperBound = CInt(Fix(Math.Sqrt(numberToTest)))
For i = 3 To upperBound - 1 Step 2
If (numberToTest Mod i) = 0 Then
Return False
End If
Next i
' It’s prime!
Return True
End Function
After your process breaks into the debugger, all the threads in your application are halted: no code executes at that point until you continue execution (excluding threads that the debugger itself uses). This halt in execution allows you to inspect the state of your application at that moment. When inspecting application state, you often have a mental picture of what the state should be, and you can use the various debugger windows to spot a difference between expectation and reality.
The main debugging windows that developers use in Visual Studio are the Threads window, the Call Stack window, and the variable windows (Locals, Autos, Watch).The Threads window displays a list of all the threads in your process, including information such as the thread ID and thread priority and an indication (a yellow arrow) of the current thread, which by default is the thread that was executing when the debugger broke into the process. Probably the most important information about a thread is where it was executing when the debugger halted its execution, shown by the callstack frame in the Location column. Hovering your cursor over that column reveals the equally important call stack -- the series or method calls that the thread was in the process of executing before reaching the current location.
The Call Stack window, which displays the call stack of the current thread, provides much richer information about the call stack, including interaction opportunities.
To display the call stack of another thread in the Call Stack window, you have to make the other thread current by double-clicking it in the Threads window. The method it is currently executing in (which is at the top of the call stack) is indicated by a yellow arrow and is known as the "topmost frame," the "leaf frame," or the "active stack frame." This is the method from which the thread will continue execution when you leave the debugger and continue running the application. By default, the active stack frame is also the current stack frame -- in other words, the method that drives the variable inspection, which we'll describe next.
Figure 4 Setting Conditional Breakpoints
The variable windows are used to inspect the values of variables in your application. The variables of local methods are usually browsed in the Locals and Autos windows; global state (variables not declared in a method) can be examined by adding them to the Watch window. Starting with Visual Studio 2005, more and more developers examine state by hovering their mouse pointers over a variable of interest and reviewing the resulting pop-up DataTip (which can be thought of as a shortcut to the Quick Watch windows). It is important to note that values for variables can be displayed only if the variables are in the scope of the current stack frame (which, as we established earlier, is by default the active stack frame of the current thread).
To examine variables that were in scope earlier in the call stack of the thread, you need to change the current stack frame by double-clicking the stack frame you want to examine in the Call Stack window. At this point, the new current stack frame is indicated by a green curved tail arrow (while the active stack frame retains the yellow arrow). If you also want to examine variables on another thread, you need to change the current thread in the Threads window and then switch the current frame on its call stack in the Call Stack window.
In summary, when you break into your process in the debugger, you can very easily inspect the variables in scope at the executing method of one of the threads. However, to create a complete picture of where all your threads are executing, you need to individually examine the calls stack of each thread by double-clicking each thread to make it current, looking at the Call Stack window, and then creating the holistic picture mentally. Furthermore, to examine variables on various stack frames of various threads, two levels of indirection are needed again: switch threads and then switch frames.
Parallel Stacks
When applications use more threads (which will become commonplace as people use machines with more processing resources), you need to be able to see in a single view where those threads are executing at any given moment. That is what the Parallel Stacks tool window in Visual Studio 2010 delivers.
To preserve screen real estate, but also to indicate methods of particular interest to parallelism scenarios, the window
coalesces into the same node the call stack segments that are common among threads at their root. For example, in Figure 1, you can see the call stacks of three threads in a single view. The figure shows one thread that went from Main to A to B and two other threads that started from the same external code and then went to A. One of these threads continued to B and then to some external code, and the other thread continued to C and then to some AnonymousMethod. AnonymousMethod is also the active stack frame, and it belongs to the current thread. Many other features are supported in this window, such as zoom, a bird's-eye view, filtering of threads via flagging, and most of the same functionality already available in the Call Stack window.
Figure 5 Choosing the Freeze All Threads But This Command
Figure 6 Coalescing of Stack Frames
If your application creates tasks rather than threads, you can switch to a task-centric view. In this view, call stacks of threads not executing tasks are omitted. Additionally, call stacks for threads are trimmed to represent the real call stacks of tasks -- that is, a single-thread call stack could include two or three tasks that you want to split out and view separately. A special feature of the Parallel Stacks window allows you to pivot the diagram on a single method and clearly observe the callers and callees of that method context.
Figure 7 Task-Based Code with Dependencies
//C#
static void Main(string[] args) // WARNING: Buggy!
{
var task1a = Task.Factory.StartNew(Step1a);
var task1b = Task.Factory.StartNew(Step1b);
var task1c = Task.Factory.StartNew(Step1c);
Task.WaitAll(task1a, task1b, task1c);
var task2a = Task.Factory.StartNew(Step2a);
var task2b = Task.Factory.StartNew(Step2b);
var task2c = Task.Factory.StartNew(Step2c);
Task.WaitAll(task1a, task1b, task1c);
var task3a = Task.Factory.StartNew(Step3a);
var task3b = Task.Factory.StartNew(Step3b);
var task3c = Task.Factory.StartNew(Step3c);
Task.WaitAll(task3a, task3b, task3c);
}
‘Visual Basic
Shared Sub Main(ByVal args() As String) ' WARNING: Buggy!
Dim task1a = Task.Factory.StartNew(Step1a)
Dim task1b = Task.Factory.StartNew(Step1b)
Dim task1c = Task.Factory.StartNew(Step1c)
Task.WaitAll(task1a, task1b, task1c)
Dim task2a = Task.Factory.StartNew(Step2a)
Dim task2b = Task.Factory.StartNew(Step2b)
Dim task2c = Task.Factory.StartNew(Step2c)
Task.WaitAll(task1a, task1b, task1c)
Dim task3a = Task.Factory.StartNew(Step3a)
Dim task3b = Task.Factory.StartNew(Step3b)
Dim task3c = Task.Factory.StartNew(Step3c)
Task.WaitAll(task3a, task3b, task3c)
End Sub
Parallel Tasks
In addition to looking at the real call stacks of tasks in the Parallel Stacks window, another new debugger window exposes additional information about tasks, including the task ID, the thread assigned to the task, the current Location, and the entry point (the delegate) passed to the task at creation. This window, called the Parallel Tasks window, exposes features similar to the Threads window, such as indicating the current task (the top-most task running on the current thread), the ability to switch the current task, flagging of tasks, and freezing and thawing threads.
Figure 8 Using Parallel Tasks to Find Dependency Problems
Perhaps the biggest value to developers is the Status column. The information provided in the Status column allows you to distinguish between running tasks and tasks that are waiting (on another task or on a synchronization primitive) or tasks that are deadlocked (a specialization of waiting tasks for which the tool detects a circular wait chain). The Parallel Tasks window also displays scheduled tasks, which are tasks that have not run yet but are sitting in some queue waiting to be executed by a thread. An example can be seen in Figure 2. For more information on both the Parallel Stacks and Parallel Tasks windows, see the blog posts at danielmoth.com/Blog/labels/ParallelComputing.html and the MSDN documentation at msdn.microsoft.com/dd554943(VS.100).aspx.
Figure 9 Deadlocking Code
//C#
static void Main(string[] args)
{
int transfersCompleted = 0;
Watchdog.BreakIfRepeats(() => transfersCompleted, 500);
BankAccount a = new BankAccount { Balance = 1000 };
BankAccount b = new BankAccount { Balance = 1000 };
while (true)
{
Parallel.Invoke(
() => Transfer(a, b, 100),
() => Transfer(b, a, 100));
transfersCompleted += 2;
}
}
class BankAccount { public int Balance; }
static void Transfer(BankAccount one, BankAccount two, int amount)
{
lock (one) // WARNING: Buggy!
{
lock (two)
{
one.Balance -= amount;
two.Balance += amount;
}
}
}
‘Visual Basic
Shared Sub Main(ByVal args() As String)
Dim transfersCompleted = 0
Watchdog.BreakIfRepeats(Function() transfersCompleted, 500)
Dim a = New BankAccount With {.Balance = 1000}
Dim b = New BankAccount With {.Balance = 1000}
Do
Parallel.Invoke(Sub() Transfer(a, b, 100), Sub() Transfer(b, a, 100))
transfersCompleted += 2
Loop
End Sub
Friend Class BankAccount
Public Balance As Integer
End Class
Shared Sub Transfer(ByVal one As BankAccount, ByVal two As BankAccount, ByVal amount As Integer)
SyncLock one ' WARNING: Buggy!
SyncLock two
one.Balance -= amount
two.Balance += amount
End SyncLock
End SyncLock
End Sub
Find the Bug
One of the best ways to understand new tooling functionality is to see it in action. To do that, we've created a few buggy code snippets, and we'll use the new tool windows to find the underlying errors in the code.
Single-Stepping
First, consider the code shown in Figure 3. The goal of this code is to output the prime numbers between 1 and 10,000,000 and to do so in parallel. (The parallelization support is provided by Parallel LINQ; see
blogs.msdn.com/pfxteam and msdn.microsoft.com/dd460688(VS.100).aspx for more information.) The implementation of IsPrime is buggy, as you can see by running the code and viewing the first few numbers output:
2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 23, 25, ...
Most of these numbers are primes, but 9, 15, and 25 are not. If this were a single-threaded application, you could easily step through the code to find the reason for the inaccurate results. However, when you perform single-stepping (choose Debug > Step Into, for example) in a multithreaded program, any thread in the program is eligible for stepping. This means that as you step you might be jumping between threads, making it more difficult to understand the control flow and the diagnostic information about the current location in the program. To assist with this, you can take advantage of several capabilities of the debugger. The first is to set conditional breakpoints.
As shown in Figure 4, you can set a breakpoint (in this case on the first line of the IsPrime method) and indicate to the debugger to break in only when a certain condition is met -- in this case, when one of the inaccurate "primes" is being evaluated.
We could have set the debugger to break in when one of these values was hit (rather than when any of them was hit), but we can't make assumptions about the order in which PLINQ evaluates the values under the covers. Instead, we told the debugger to look for any of these values so as to minimize wait time before it breaks.
After the debugger breaks in, we want to tell it to single-step just the current thread. To do that, we can take advantage of the debugger's ability to freeze and thaw threads and specify that frozen threads won't run until we thaw them. The new Parallel Tasks window makes it easy to find the thread that should be allowed to continue (look for the yellow arrow icon) and to freeze all other threads (via the ContextMenu), as shown in Figure 5.
With the irrelevant thread(s) frozen, we can now single-step through our buggy IsPrime. By debugging numberToTest==25, we can easily see what's gone wrong: the loop should include the upperBound value in its test, whereas this value is currently being excluded because the loop uses the less-than operator rather than less-than-or-equals. Here, the square root of 25 is 5, and 25 is evenly divisible by 5, but 5 won't be tested, so 25 is erroneously categorized as a prime.
The Parallel Stacks window also provides a nice, consolidated view of what's happening in our program when we break. Figure 6 shows the current state of the application after we run it again, and this time explicitly break in using the debugger's Break All capability.
PLINQ is executing IsPrime in multiple tasks, and the value of numberToTest for all those tasks is visible in the pop-up, here showing that Task 1 is processing numberToTest==8431901, while Task 2 is processing numberToTest==8431607.
Dependency Problems
The code in Figure 7 shows an instance of a pattern common in parallel applications. This code forks off multiple operations (step1a, step1b, step1c, which are all methods of the form "void StepXx()") that might run in parallel and then joins on them. Subsequently, the application forks again with code that requires the previous operations to already be complete because of a dependency on the operations' side effects (such as writing data to some shared arrays).
Unfortunately, this code includes a bug, and the developer who wrote it is seeing some inaccurate results being computed by the third set of tasks. The implication is that even though the developer is waiting for all of the prior tasks to complete, something is amiss, and not all of the previous computations have actually completed their results. To debug the code, the developer sets a breakpoint on the last WaitAll call and uses the Parallel Tasks window to see the current state of the program, which is shown in Figure 8.
Sure enough, the Parallel Tasks window shows that the Task for Step2c is still running even though the tasks for Step 3 have been scheduled. A review of the second Task.WaitAll call demonstrates why: because of typing errors, task1a, task1b, and task1c are being waited on instead of their task2 counterparts.
Deadlocks
Figure 9 provides the prototypical example of a deadlock scenario, which results from not paying attention to lock ordering. The main code is continually transferring money between bank accounts. The Transfer method is meant to be thread-safe so that it can be called concurrently from multiple threads. As such, it internally locks on the BankAccount objects handed to it simply by locking on the first and then locking on the second. Unfortunately, this behavior can lead to deadlocks, as running this code will demonstrate. Eventually, the debugger breaks in when it finds that no transfers are proceeding. (Breaking in is performed using code that issues a Debugger.Break if it notices that no new transfers have been completed after a certain amount of time. This code is included in the download that accompanies this article.)
Figure 10 Information on Deadlocks in Parallel Tasks
Figure 11 Parallel Stacks Showing Deadlocks
Figure 12 Method View in Parallel Stacks
When you're working in the debugger, you immediately see a graphical representation demonstrating that there is a deadlock, as shown in Figure 10. The figure also demonstrates that hovering the pointer over the Waiting-Deadlocked status provides further details about exactly what's being waited on and which thread is holding the protected resource. Looking at the Thread Assignment column, you can see that Task 2 is waiting on a resource held by Task 1, and if you were to hover over Task 1, you would see the inverse.
This information can also be deduced from the Parallel Stacks tool window. Figure 11 shows the Task View in Parallel Stacks, which highlights that there are two tasks, each of which is blocked in a call to Monitor.Enter (due to the lock statements from Figure 9). And Figure 12 demonstrates the Method View available in the Parallel Stacks window (via the corresponding toolbar button). By focusing our view on the Transfer method, we can easily see that there are two tasks currently in Transfer, both of which have gone on to a call to Monitor.Enter. Hovering the pointer over that box provides further information on the deadlocked status of both tasks.
Figure 13 Creating a Lock Convoy
//C#
static void Main(string[] args) // WARNING: Buggy!
{
object obj = new object();
Enumerable.Range(1, 10).Select(i =>
{
var t = new Thread(() =>
{
while (true)
{
DoWork();
lock (obj) DoProtectedWork();
}
}) { Name = “Demo “ + i };
t.Start();
return t;
}).ToList().ForEach(t => t.Join());
}
‘Visual Basic
Public Shared Sub Run()
'Watchdog.BreakIn(3000);
Dim obj As New Object()
Enumerable.Range(1, 10).Select(Function(i)
Dim t = New Thread(Sub()
Do
DoWork()
SyncLock obj
DoProtectedWork()
End SyncLock
Loop
End Sub)
t.Name = "Demo" & i
t.Start()
Return t
End Function).ToList().ForEach(Sub(t)
t.Join()
End Sub)
End Sub
Lock Convoys
Lock convoys can occur when multiple threads repeatedly compete for the same protected region. (Wikipedia provides a good summary of lock convoys at en.wikipedia.org/wiki/Lock_convoy). The code in Figure 13 provides a quintessential example of a lock convoy: multiple threads repeatedly doing some amount of work outside a protected region but then taking a lock to do some additional work inside that protected region. Depending on the ratio between the work inside and outside the region, performance problems might result. These sorts of problems are visible after program execution by using a tool like the concurrency profiler that's available in Visual Studio 2010, but they can also be caught during execution by using a debugging tool like the Parallel Stacks window.
Figure 14 Lock Convoys with Parallel Stacks
Figure 14 shows an execution of the code in Figure 13. The code was broken into a few seconds into its execution. You can see at the top of the image that nine threads are currently blocked waiting on a monitor -- all the threads waiting for one thread to exit DoProtectedWork so that one of them can continue into the protected region.
Wrapping Up
In this article, you've seen examples of how Visual Studio 2010 debugger tool windows can simplify the act of finding bugs in task-based code. The task-based APIs for managed and native code are richer than what we could show in the short examples in this article, and we encourage you to explore them further in .NET Framework 4 and Visual C++ 10. On the tools front, in addition to the two new debugger windows discussed, a new parallel performance analyzer is integrated into the existing profiler in Visual Studio.
To get your hands on all the bits above, download the beta of Visual Studio 2010 from msdn.microsoft.com/dd582936.aspx.
Daniel Moth works in the Parallel Computing Platform Team at Microsoft. Stephen Toub works in the Parallel Computing Platform Team at Microsoft. He is also a contributing editor to MSDN Magazine.