September 2009

Volume 24 Number 09

Synchronization Coverage - Code Coverage for Concurrency

By Chris Dern | September 2009

We are at another crossroads in the industry, as more and more transistor-laden processors demand multi-threaded code to realize their full potential. While machines from desktops to netbooks sport no less than dual-core processors under the hood, throngs of hungry transistors sit idle -- crying for multi-threaded applications to devour. To address the oncoming wave of concurrent applications, companies like Intel and Microsoft are racing to market with profilers, frameworks, debuggers and libraries. As multi-threaded applications proliferate, so too will discussions of deadlocks, live locks and data races become increasingly common across the software industry. Software development professionals need to adopt new tools, techniques and metrics that can deal with multi-threaded software.

Code Coverage and Concurrency

Traditional code coverage metrics, such as statement, block, and branch, fail to address test adequacy concerns introduced by concurrency. Let's take statement coverage as an example. Although an imperfect metric, statement coverage is popular, and it is helpful to a degree in finding out how thoroughly you have tested your code. As a refresher, statement coverage measures how many statements of your application's code were executed. It is a simple metric that is used by many unit testing tools, and many software development teams include high statement coverage as a code-quality bar. Unfortunately, statement coverage does not give you any visibility as to how much of the concurrency in your code is being tested.

As an example, Figure 1 shows a simple implementation of a thread-safe queue type written in C# (throughout this article, we use C# and .NET for our examples, but the idea of synchronization coverage is applicable to native code as well).

Our example has only two methods: Enqueue and Dequeue, and wraps a .NET Queue<T> by surrounding every enqueue and dequeue operation in a lock. How might we test this queue implementation? The following code shows a simple unit test for our thread-safe queue:

void SimpleQueueTest() {

ThreadSafeQueue<int> q = new ThreadSafeQueue<int>();

q.Enqueue(10);

Assert.AreEqual(10, q.Dequeue());

}

Note that this simple test gives us 100 percent statement coverage, since this one test exercises every statement in our queue implementation.

But, wait, we have a problem -- the queue is supposed to be thread-safe, but thus far we've tested it using only a single thread! We aren't testing the critical behavior which was the reason why we implemented this ThreadSafeQueue type in the first place. Statement coverage tells us that we have 100 percent coverage when we use one thread on a type meant to be accessed simultaneously. Clearly, statement coverage is inadequate to tell us how much of the concurrency aspects of the code we have tested. What are we missing?

Figure 1 A Thread-Safe Queue implementation in C#

public class ThreadSafeQueue<T> {

object m_lock = new object();

Queue<T> m_queue = new Queue<T>();

public void Enqueue(T value) {

lock (m_lock) {

m_queue.Enqueue(value);

}

}

public T Dequeue() {

lock (m_lock) {

return m_queue.Dequeue();

}

}

}

What we are missing is a branch hidden inside the lock statement. Imagine that we replace the lock with a custom locking primitive, such as the simplified implementation of a spin-lock method, as follows:

void Acquire() {

while (!TryAcquire ()) {

// idle for a bit

Spin();

}

}

The Enter method calls TryAcquire, and if it fails to acquire the lock (such that TryAcquire returns false), it spins a little and tries again. If we combined this custom locking code with our Enqueue method, how would that affect statement coverage? The following code shows how we can rewrite Enqueue with custom locking:

public void Enqueue(T value) {

while (!TryAcquire ()) {

//idle for a bit

Spin();

}

m_queue.Enqueue(value);

Release();

}

Now, if we run our test, we suddenly see missing statement coverage. Any single threaded test will miss the spin statement, because TryAcquire will only return false if there was another thread that already holds the lock. The only way the method will spin is if some other thread has entered the critical section. That is, we can only cover the spin statement if there was contention. This implicit branch inside the lock statement is the source of our coverage hole.

Synchronization Coverage

Because we don’t expect anybody to replace lock statements with their own mutual exclusion primitives (nor do we advocate doing so), we need to find a way to expose this hidden branch, so that we can measure the amount of contention that occurred during our tests. Researchers at IBM came up with a code coverage model called Synchronization Coverage that can do just this.

You can read the paper referenced in the More Information section below, but the core idea is simple. First, we take a synchronization primitive -- for example the .NET Monitor type (which is used by the lock keyword in C# and the SyncLock keyword in Visual Basic). Then, at every point in your code where a lock is being acquired, record whether it either blocked (in the case of, say, Monitor.Enter), or otherwise failed to acquire the lock (for example with Monitor.TryEnter). Either outcome means that contention occurred. Thus, to say that we have coverage over a lock, it's not sufficient to have executed a lock statement; we must also execute the lock statement while some other thread is holding the lock.

While this technique applies to both native and managed applications, the rest of this article will discuss our managed solution -- a prototype synchronization coverage tool for .NET called Sync Cover.

Let's take a moment to clarify the concepts used by our coverage tool. A sync point is a specific lexical call site, which invokes a synchronization method. Taking a quick look back at our sample thread-safe queue (Figure 1), with our C# compiler hat on, we see two such places hidden in the lock statements in Enqueue and Dequeue. Upon any single execution we are interested in two types of coverage: Statement Coverage, where the method did not experience any contention; and Contention Coverage, where the method was forced to block or wait for another thread to release it.
For the rest of this article we will focus specifically on System.Threading.Monitor, the work horse of the .NET synchronization stable. However, it's useful to note this approach works equally well with other primitives, like System.Threading.Interlocked.CompareExchange.

Round Tripping with IL

Our goal for this synchronization coverage tool is to transparently instrument existing .NET assemblies such that we can intercept every Monitor.Enter call in that assembly and determine whether that sync point encountered contention. Like a majority of the code coverage solutions in use today, our tool applies techniques that rewrite IL (the .NET Intermediate Language) to produce instrumented versions of the target assemblies. Although we do spend a majority of our time in C#, by dealing directly with the IL, in theory this tool should work for all .NET languages that compile to IL.

While there are a few possible technologies to enable code rewriting, including the newly released Common Compiler Infrastructure from Microsoft Research (MSR), for simplicity we chose to fall back to the trusty IL compiler and decompiler combo ILASM and ILDASM. Working directly with the IL in plain text files proved to be a huge boon during the discovery, design and development phases of the project, enabling us to explore design choices in a "pay-for-play" approach with our time investment. We proved the viability and practicality of our solution at first via a completely manual process, comprised of batch files and notepad. While this may not be the best choice for a production-quality tool, we recommend this approach for any similar prototype rewriting application.

With this overall design decision complete, let's take a quick tour of our tool chain, as shown in Figure 2.

Using ILDASM, we decompile the target assembly into a text file with the IL instructions. We then load the source code into memory behind an object-based façade. We then apply the required modifications and code injections to the code, emitting the modified source back to disk. Then it is just a simple matter of recompiling the covered assembly with ILASM, and checking it with PeVerify to catch any silly invalid transformations we may have applied. Once we have our coverage assemblies, we perform our test pass as normal, and collect the runtime coverage file for later analysis.

Our tool set is comprised of two main parts: Sync Cover, which automates this instrumentation process and provides an IL round trip platform upon which we build our instrumentation; and Cover Viewer, a Windows Presentation Foundation (WPF) application that provides a familiar and intuitive way to explore the coverage files.

Capturing Contention

By now, you should be wondering how we can observe and record when a sync point experiences contention. Looking back at the "hidden-branch" mental model of how Monitor.Enter could be implemented provides us a clue, if we apply it almost literally in our transformation. We see in the preceding simplified implementation of a spin-lock code that the Acquire method captures this exact information in the return value. All we need to do is see it.

As much as we like to rewrite other people's IL, inserting arbitrary branches promises to open a Pandora's box of problems we would rather avoid. Recalling the second problem of recording and linking the sync point data suggests we turn to our programmers tool box and reach for some indirection.

We solve both problems by introducing a wrapper class -- a synchronization coverage-aware façade over Monitor whose methods take a syncId argument. This syncId is a generated unique integer that uniquely identifies a sync point -- that is, we give every sync point a unique ID, and when we hit a sync point, we pass that ID to our wrapper class. Our coverage implementation begins by calling Monitor.TryEnter, recording the result and then, if needed, delegating to the original blocking Monitor.Enter. The sync point states are managed by a simple in-memory database class we call the CoverageTracker. Putting all the pieces together, our coverage version of Monitor.Enter looks like this, as shown in Figure 3.

Figure 3 Coverage Version of Monitor.Enter

namespace System.Threading.SyncCover {

public static class CoverMonitor {

public static void Enter(object obj, int syncId) {

if (Monitor.TryEnter(obj)) {

coverageTracker.MarkPoint(syncId, false); // No Contention.

} else {

Monitor.Enter(obj);

coverageTracker.MarkPoint(syncId, true); // Contention

}

}

}

}

Note that the code in Figure 3 is intended to support the .NET 3.5 version of Monitor.Enter. The upcoming .NET 4 Monitor type includes changes that make it resilient against asynchronous exceptions -- see blogs.msdn.com/ericlippert/archive/2009/03/06/locks-and-exceptions-do-not-mix.aspx. Adding support for the .NET 4 overloads is just a matter of overloading the wrappers in a similar fashion.Once we have our method in hand, the instrumentation process becomes fairly straight forward. Let's look at the IL produced by the .NET 3.5 C# compiler for the lock statement, as follows:

IL_0001: ldfld object class ThreadSafeQueue'1<!T>::m_lock

IL_0002: call void [mscorlib]System.Threading.Monitor::Enter(object)

We just need to look for calls to Monitor.Enter and replace them with calls to our CoverMonitor.Enter, while not forgetting to inject an additional syncId argument before the method invocation. The following code illustrates this transformation:

IL_0001: ldfld object class ThreadSafeQueue'1<!T>::m_lock

I_ADD01: ldc.i4 1 // sync id

IL_0002: call void System.Threading.Coverage.Monitor::Enter(object, int32)

As a final part to this process, we should talk a bit about reporting, and what type of information is useful. So far, we know about the sync points that are identified by a syncId, injected directly into the target source code. It would be much more useful if we could get the source files and line numbers of each sync point. We found that ILDASM with the /LINENUM option provides us the information we need by extracting source file locations and line numbers from the program database (PDB).

When we encounter a new sync point during the instrumentation process, we generate the next syncId, and capture this contextual information in a map. This mapping, seen in the following code, is then emitted as a file at the end of the instrumentation:

T|ThreadSafeQueue.exe

SP|

0|D:\ThreadSafeQueue\ThreadSafeQueue.cs|9|ThreadSafeQueue`1<T>|Enqueue

1|D:\ThreadSafeQueue\ThreadSafeQueue.cs|15|ThreadSafeQueue`1<T>|Dequeue

Figure 4 Example Coverage Run

~|B3|~

A|ThreadSafeQueue.exe

C|ThreadSafeQueue.exe

D|20090610'091754

T|demo

M|DWEEZIL

O|Microsoft Windows NT 6.1.7100.0

P|4

SP|

0|1|1

1|1|0

~|E|~

Show Me the Data!

It's show time. Once we have our covered binaries, it's a simple matter of executing the program as many times as you like. This produces a runtime file containing the coverage metrics collected during the execution, as shown in Figure 4. Taking inspiration from other coverage tools in the market, we provide a simple WPF-based viewer to visualize the coverage results.

But let's step back for a moment. We know that our single-threaded test case will give us zero percent synchronization coverage. How can we fix the test so that we can cause contention? Figure 5 shows a slightly improved test that should cause contention in the Enqueue method. After running the test, we can see the results, as shown in Figure 6.

Here we see the three possible coverage cases for a given sync point. In this example, we see that Enqueue experienced at least one count of contention and so appears in green. Dequeue, on the other hand was executed but did not contend, as shown in yellow. We've also added a new property Count, which was never called, and shows as red.

Figure 5 A Test That Causes Contention on Enqueue

void ThreadedQueueTest()

{

ThreadSafeQueue<int> q = new ThreadSafeQueue<int>();

Thread t1 = new Thread(

() =>

{

for (int i = 0; i < 10000; i++)

{

q.Enqueue(i);

}

});

Thread t2 = new Thread(

() =>

{

for (int i = 0; i < 10000; i++)

{

q.Enqueue(i);

}

});

t1.Start();

t2.Start();

t1.Join();

t2.Join();

Assert.AreEqual(0, q.Dequeue());

}

Using Synchronization Coverage

So when should we use synchronization coverage? Like any code coverage metric, synchronization coverage attempts to measure how well you have tested your code. Any sort of locking in your application means that your code was meant to be accessed simultaneously, and you seriously should be curious whether your test suite actually exercises those locks. Your team should aim for 100 percent synchronization coverage, especially if your application expects to run in parallel a lot.

Trying to practice what we preach, we have used this synchronization coverage tool in the course of testing the Parallel Extensions to the .NET Framework. It has helped us find testing holes and bugs during this development cycle, and we expect to continue using the metric going forward. Two scenarios where synchronization coverage has helped us are particularly interesting:

Non-concurrent multi-threaded tests

Synchronization coverage found that some tests that we thought were running concurrently were actually not. Figure 7 is a simple multi-threaded test similar to Figure 5, but this time each thread enqueues only 10 items to the queue.

Figure 7 A Concurrent Test That Might Not Be Concurrent After All

void ThreadedQueueTest() {

ThreadSafeQueue<int> q = new ThreadSafeQueue<int>();

Thread t1 = new Thread(

() => {

for (int i = 0; i < 10; i++) {

q.Enqueue(10);

}

});

Thread t2 = new Thread(

() => {

for (int i = 0; i < 10; i++) {

q.Enqueue(12);

}

});

t1.Start();

t2.Start();

t1.Join();

t2.Join();

}

Just because we started two threads, however, does not mean that they are actually running in parallel. What we found was that these kinds of tests with very short execution times per thread more often than not will have one thread execute completely after the other thread. We want a test that consistently experiences contention. One solution is to have each thread enqueue more items to make it more likely to cause contention. A better solution is to use a tool like CHESS, which forces every interleaving between the two threads to occur.

Unnecessary locks

It is quite possible that some synchronization points are not being covered because they cannot be covered. In the following code example, if the lock on b is always acquired when the lock on a is held, then it is impossible to cover the lock on b:

lock(a) {

lock(b) {

//... stuff ...

}

}

Surprisingly, synchronization coverage has actually helped us find unnecessary locks like these. It turns out that sometimes a resource being protected by a lock may no longer need to be. For example, a thread that competes over the resource may not be needed anymore and removed from the code, but the lock on the remaining thread was not removed. While the extra lock is harmless in the behavioral sense, it could still hurt performance. (Note, though, that just because a lock is never contended for doesn't mean it's not necessary; such information just provides a starting point of an investigation to determine whether it's actually needed.)

Limitations and Future Work

Like any code coverage metric, synchronization coverage is not perfect -- it has its limitations. A chief limitation that synchronization coverage shares with other coverage metrics is that it cannot measure what is not there. Synchronization coverage cannot tell you that you need to lock over some resource, only that the resources you have locked have been contended over. Thus, even 100 percent synchronization coverage does not mean that your testing effort is done, only that you have achieved some level of thoroughness in your testing.

Another issue is that instrumenting your code for synchronization coverage could alter the scheduling of the threads in your tests such that you might get coverage in your instrumented test run, but not in your uninstrumented execution (although in our experience, we've found that this is usually not a problem). Again, a tool like CHESS can help. Our prototype tool allows us to measure contention over Monitor and Interlocked operations. In the future, we plan on adding features that will allow us to measure other synchronization primitives such as Semaphore and ManualResetEvent. We believe that a code coverage metric, like synchronization coverage, could be as useful and widespread for concurrent applications as statement coverage is for single-threaded applications.

Measure a Must

You can't improve what you can't measure. Thus, as we develop more and more multi-threaded software applications, we must measure how thoroughly we have tested the concurrency aspects of our software. Synchronization coverage is a simple, practical way to do this. We hope that as you navigate this whole new world of multi-core computing, you can take the ideas in this article to improve your quality process.

More Information

  • Parallel Computing at Microsoft:
    msdn.microsoft.com/concurrency
  • The synchronization coverage paper:
    Bron, A., Farchi, E., Magid, Y., Nir, Y., and Ur, S. 2005. Applications of synchronization coverage. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, Ill., USA, June 15-17, 2005). PPoPP '05. ACM, New York, N.Y., 206-212.
  • The CHESS tool:
    research.microsoft.com/en-us/projects/chess/default.aspx
    Musuvathi, M. and Qadeer, S. 2007. Iterative context bounding for systematic testing of multi-threaded programs. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (San Diego, Calif., USA, June 10-13, 2007). PLDI '07. ACM, New York, N.Y., 446-455.
  • Common Compiler Infrastructure (CCI):
    ccimetadata.codeplex.com

Roy Tan* is a software development engineer in test with the Parallel Computing Platform group at Microsoft. He received his Ph.D. in Computer Science at Virginia Tech in 2007.*
Chris Dern* is a concurrency software development engineer in test for the Parallel Computing Platform team at Microsoft -- where the only thing better than writing concurrency software, is testing it.*