CLR 4.0 ThreadPool Improvements: Part 1

This is the first in a series of posts about the improvements we are making to the CLR thread pool for CLR 4.0 (which will ship with Visual Studio 2010). This post will cover changes to the queuing infrastructure in the thread pool, which aim to enable high-performance fine-grained parallelism via the Task Parallel Library. Future posts will cover the “thread injection” algorithm, and any other topics that readers would like to see.

Please note that all the usual caveats apply here: I’m discussing pre-release software, and all details are subject to change before final release. In fact, one goal of this post is to solicit feedback, so we can know what changes we need to make before we ship. J

A thread pool basically has two functions: It maintains a queue (or queues) of work to be done, and a collection of threads which execute work from the queue(s). So designing a thread pool really comes down to a) finding ways to enqueue and dequeue work items very quickly (to keep the overhead of using the thread pool to a minimum) and b) developing an algorithm for choosing an optimal number of threads to service the queues. This post will cover part (a).

In all prior releases of the CLR, the thread pool exposed a single way to queue work: ThreadPool.QueueUserWorkItem (which I will abbreviate as QUWI from here on out). There are a couple of overloads of this method, and also a version called UnsafeQueueUserWorkItem, but these all amount to basically the same thing: you give us a delegate, and we stick it on a queue for later execution.

(Really there are even more ways to queue work. As mentioned in my previous post we really have two pools of threads – the “worker threads” and the “I/O threads.” Work is queued to the I/O threads in response to the completion of asynchronous I/O, or manually via ThreadPool.UnsafeQueueNativeOverlapped. We currently do not plan any significant changes to the I/O pool for CLR 4.0, as our focus for this release is on enabling fine-grained computational parallelism. For the remainder of this post, we will only discuss the mechanisms behind the “worker threads.”)

QUWI conveys basically zero information about each work item, aside from that it exists. This places some important constraints on the execution of these items. For example, the thread pool does not know whether individual work items are related or not, so it has to assume they are all completely independent, implying that we cannot reorder work to optimize its execution, as independent work items typically must be executed in FIFO order to ensure fairness. (Imagine each work item represents a request from a user - you would not want to keep earlier requests waiting while later requests are processed, as this would result in unacceptably long latency for the users who made their requests first).

This means that we are basically forced to use a single FIFO queue for all work queued via QUWI. In prior versions of the CLR, this queue was a simple linked list, protected by a Monitor lock. This incurs some overhead: we must allocate nodes for the list (and pay the cost of the GC having to traverse the list each time a GC occurs), and we must pay the cost of acquiring that lock every time we enqueue or dequeue a work item.

(Another aside: please do not take the above to mean that we make any guarantees about strict FIFO ordering of work item execution. In fact, we violate this “rule” already: since .NET 3.5, the CLR thread pool has maintained separate FIFO queues for each AppDomain in the process, and an additional independent FIFO queue for “native” work items such as those queued by a host ( being the prime user of this feature). We round-robin between these work queues, allowing each to execute work for some time before moving on to the next.

This strategy is motivated by performance concerns, as it greatly reduces the number of transitions between AppDomains, which are fairly expensive. But it is designed to maintain fairness, which is the chief concern which we can never completely abandon. We may make further changes in the future which further deviate from the strict FIFO model, but we are unlikely to ever make QUWI really unfair, which, as we will see, is crucial to achieving good performance for fine-grained workloads.)

This was fine for the kinds of workloads for which the thread pool was originally designed. These are relatively "coarse" workloads, where each work item represents a fairly large amount of work. The canonical example is an web application, where each work item represents the generation of an entire web page. In such workloads, the work itself takes long enough that the overhead of allocating queue nodes and acquiring locks is barely noticeable.

However, in the new world of machines with rapidly increasing core counts, there is increased interest in more "fine grained" work. Where before the job of the thread pool was to take a large number of independent, coarse-grained, tasks and funnel them onto a few threads, we are increasingly being asked to execute many very small tasks representing tiny pieces of some larger operation.

Anyone who has tried executing such a workload on the existing CLR thread pool has probably found that it's not a simple matter of calling QUWI for each piece of the calculation; with such tiny work items, the overhead of enqueuing and dequeuing the work can be much greater than the work itself, resulting in slower execution than if we had just run the work on a single thread to begin with! It is possible to make this work, by “batching” work into a smaller number of calls to QUWI. There are many strategies for this, all of which are fairly complex in the general case. We would like to make this easy, but the current QUWI is insufficient for this goal.

We can improve this situation in a couple of ways: we can implement a more efficient FIFO queue, and we can enhance the API to allow the user to give us more information, allowing us to turn to even more efficient queuing strategies. For CLR 4.0, we are doing both of these.

Faster FIFO

Recall that the overhead of the existing FIFO queue comes from the expense of allocating and traversing the data structure, and the cost of acquiring the lock on each enqueue and dequeue operation. For 4.0, we are switching to a lock-free data structure with much lower synchronization overhead. More importantly, this new queue is much friendlier to the GC; we still need to allocate a new object for each call to QUWI, but these objects are smaller, and are tracked in large “chunks” which are much easier for the GC to traverse than the simple linked list used previously. This new queue is virtually identical to System.Collections.Concurrent.ConcurrentQueue<T>, which is also new in 4.0.

Improving the performance of QUWI is nice, as it benefits existing applications which use the thread pool without requiring any changes to the application code. How much of a speedup you can expect will depend greatly on many factors, including your application’s workload and the details of the particular hardware on which it executes, but for fine-grained workloads on multi-core hardware the speedup should be significant.

However, we are still restricted in what we can do here – we still have very little information about the work we’re executing, and so we still need to use the same basic strategy to execute it. We can trim overhead here and there, but QUWI will probably never be a great way to execute very fine-grained workloads. We need a new API.

The Task Parallel Library

The Task Parallel Library (TPL) is a collection of new classes specifically designed to make it easier and more efficient to execute very fine-grained parallel workloads on modern hardware. TPL has been available separately as a CTP for some time now, and was included in the Visual Studio 2010 CTP, but in those releases it was built on its own dedicated work scheduler. For Beta 1 of CLR 4.0, the default scheduler for TPL will be the CLR thread pool, which allows TPL-style workloads to “play nice” with existing, QUWI-based code, and allows us to reuse much of the underlying technology in the thread pool - in particular, the thread-injection algorithm, which we will discuss in a future post.

I won’t discuss all of the details of the TPL API, which better covered by its authors. From the point of view of the performance of the thread pool, the important thing about TPL is that it is a much richer API than QUWI, giving the thread pool much more information about the work being executed. In particular, the new Task type exposes the notion of parent/child relationships, giving us some idea of the structure of the overall computation being performed by the individual work items. Having this information opens up possibilities for much more efficient execution of these tasks.

Even without parent/child relationships, Task is a major improvement over QUWI. QUWI returns nothing of use to the caller; it simply queues a delegate, and leaves it up to the implementation of that delegate to coordinate its activities with the rest of the application. QUWI provides no means of waiting for the completion of the work item, for handling exceptions, or getting the result of a computation. Task provides all of this in a very easy-to-use form, while adding very little overhead vs. QUWI.

The fact that Task has a Wait method is not just a convenience; it eliminates one of the most common problems people face when using QUWI. It is fairly common for one work item to need to wait for the execution of another work item to complete. If the second work item has not yet begun executing, it will be sitting in the queue waiting for a worker thread to pick it up. It is possible that there are no available worker threads – maybe they’re all waiting for other work items to complete! This can cause deadlock in the worst case, and very slow execution in the best, as the thread pool may be slow to add more worker threads to pick up these work items. Task.Wait, on the other hand, knows it’s waiting for another task, and is tightly integrated with the thread pool such that it is able to determine whether the task has started executing, and if not it executes it immediately, in-line on the current thread. This greatly improves performance and eliminates the possibility of deadlock in this situation.

For new code, Task is now the preferred way to queue work to the thread pool.

Top-level Tasks have no parent. These are Tasks created by non-thread-pool threads, or with certain options specified at Task-creation time. These tasks are queued to the same FIFO queue we use for QUWI, and thus benefit from the improvements we’ve made there – but they are also subject to the same limitations. Tasks queued in this way are simply a better QUWI – but now the fun starts: A parent task can create child tasks. This happens whenever a Task creates another Task (unless it overrides this behavior). These children are implicitly treated as sub-tasks of the larger task. We assume that sub-tasks can be executed in any order – fairness is not necessary – because all that matters is that the overall operation be completed as fast as possible. This lets us throw those FIFO restrictions out the window, and opens up the possibility for much more efficient work scheduling strategies.

Work Stealing

Since a child task is just a piece of a larger task, we don’t need to worry about execution order. We just need to execute these things quickly. One well-known strategy for fast execution of unordered work items is “work stealing.”  Joe Duffy and Daniel Moth explain this very well; click on the links if you’re interested.

The most important aspect of work-stealing is that it enables very fast enqueue and dequeue in the typical case, often requiring no synchronization at all. This virtually eliminates a large part of the overhead of QUWI, when working with child tasks. We still do need to allocate memory for the Task itself, and for the work-stealing queue, but like the improvements to the FIFO queue these data structures have been optimized for good GC performance. Parent tasks are fast; child tasks are much faster.

There are still some limitations to how quickly tasks can be executed. If all tasks are top-level (non-child) tasks, they are subject to the FIFO ordering constraints of QUWI (albeit with much richer functionality). And even with work-stealing, we need to allocate and queue Task instances for every work item. To get even better performance, we need even more information about the work. Which brings us to…

Parallel.For and PLINQ

While not, strictly speaking, features of the CLR thread pool, the methods of the new Parallel class and PLINQ are critical new features of the public concurrency APIs in CLR 4.0. In fine-grained parallel applications, it is very common to need to execute the same code, over and over, with different data inputs. With QUWI or Task, this means allocating and queuing separate workitems for each input. The thread pool infrastructure does not know that all of these work items do the same thing, so it literally has to execute them, one at a time, as if they were completely different tasks.

Parallel.For, Parallel.ForEach, and PLINQ, provide a better way. These are essentially different ways of expressing the same thing: here is some code that needs to execute N times, as quickly as possible.

Just as with the parent/child relationships that Task provides, this extra information enables more aggressive optimization. These “parallel loops” do not need to be broken down into separate work items for each loop iteration. All that is needed is to break them into enough chunks (“partitions”) that they can be efficiently load-balanced across all available machine resources. A typical scenario might be that 1,000,000 iterations need to be broken into, say, four work items.

There is, of course, some overhead introduced by the need to dynamically partition the data (done automatically by the framework). But this pales in comparison to the savings of not having to allocate, queue, and dequeue millions (or more!) work items. In a test I just tried, for a particular work load on one of my machines, Parallel.For executed more than 300 times as fast as the equivalent naïve usage of QUWI.

Where do we go from here?

By now you’ve probably got the general theme: the more information we have about a workload, the faster we are likely to be able to execute it. I expect this theme to continue in future releases. The challenge is to find new ways to easily express (or automatically extract) useful information about parallel workloads, which can be used by the thread pool (or higher-level abstractions like PLINQ) to enable more optimizations. I think this is going to be an interesting space to watch for quite some time.

In the meantime, please try out the new mechanisms we are providing in Visual Studio 2010 Beta 1. Try it with the kinds of workloads you expect to use in production; one of the biggest challenges we face is that we can really only guess at what developers will do with this stuff in the real world, so feedback from our customers is extremely important to ensure that these new mechanisms will meet your needs in the final product.

Feel free to post any questions you may have in the comments for this post; I’ll try to answer what I can. My next post will cover the thread pool’s “thread injection” algorithm, which is how we determine how many threads should be servicing the various queues I’ve discussed here. If there’s something else you’d like me to cover, please let me know.