CLR 4.0 ThreadPool Improvements: Part 1

This is the first in a series of posts about the improvements we are making to the CLR thread pool for CLR 4.0 (which will ship with Visual Studio 2010). This post will cover changes to the queuing infrastructure in the thread pool, which aim to enable high-performance fine-grained parallelism via the Task Parallel Library. Future posts will cover the “thread injection” algorithm, and any other topics that readers would like to see.

Please note that all the usual caveats apply here: I’m discussing pre-release software, and all details are subject to change before final release. In fact, one goal of this post is to solicit feedback, so we can know what changes we need to make before we ship. J

A thread pool basically has two functions: It maintains a queue (or queues) of work to be done, and a collection of threads which execute work from the queue(s). So designing a thread pool really comes down to a) finding ways to enqueue and dequeue work items very quickly (to keep the overhead of using the thread pool to a minimum) and b) developing an algorithm for choosing an optimal number of threads to service the queues. This post will cover part (a).

In all prior releases of the CLR, the thread pool exposed a single way to queue work: ThreadPool.QueueUserWorkItem (which I will abbreviate as QUWI from here on out). There are a couple of overloads of this method, and also a version called UnsafeQueueUserWorkItem, but these all amount to basically the same thing: you give us a delegate, and we stick it on a queue for later execution.

(Really there are even more ways to queue work. As mentioned in my previous post we really have two pools of threads – the “worker threads” and the “I/O threads.” Work is queued to the I/O threads in response to the completion of asynchronous I/O, or manually via ThreadPool.UnsafeQueueNativeOverlapped. We currently do not plan any significant changes to the I/O pool for CLR 4.0, as our focus for this release is on enabling fine-grained computational parallelism. For the remainder of this post, we will only discuss the mechanisms behind the “worker threads.”)

QUWI conveys basically zero information about each work item, aside from that it exists. This places some important constraints on the execution of these items. For example, the thread pool does not know whether individual work items are related or not, so it has to assume they are all completely independent, implying that we cannot reorder work to optimize its execution, as independent work items typically must be executed in FIFO order to ensure fairness. (Imagine each work item represents a request from a user - you would not want to keep earlier requests waiting while later requests are processed, as this would result in unacceptably long latency for the users who made their requests first).

This means that we are basically forced to use a single FIFO queue for all work queued via QUWI. In prior versions of the CLR, this queue was a simple linked list, protected by a Monitor lock. This incurs some overhead: we must allocate nodes for the list (and pay the cost of the GC having to traverse the list each time a GC occurs), and we must pay the cost of acquiring that lock every time we enqueue or dequeue a work item.

(Another aside: please do not take the above to mean that we make any guarantees about strict FIFO ordering of work item execution. In fact, we violate this “rule” already: since .NET 3.5, the CLR thread pool has maintained separate FIFO queues for each AppDomain in the process, and an additional independent FIFO queue for “native” work items such as those queued by a host (ASP.net being the prime user of this feature). We round-robin between these work queues, allowing each to execute work for some time before moving on to the next.

This strategy is motivated by performance concerns, as it greatly reduces the number of transitions between AppDomains, which are fairly expensive. But it is designed to maintain fairness, which is the chief concern which we can never completely abandon. We may make further changes in the future which further deviate from the strict FIFO model, but we are unlikely to ever make QUWI really unfair, which, as we will see, is crucial to achieving good performance for fine-grained workloads.)

This was fine for the kinds of workloads for which the thread pool was originally designed. These are relatively "coarse" workloads, where each work item represents a fairly large amount of work. The canonical example is an ASP.net web application, where each work item represents the generation of an entire web page. In such workloads, the work itself takes long enough that the overhead of allocating queue nodes and acquiring locks is barely noticeable.

However, in the new world of machines with rapidly increasing core counts, there is increased interest in more "fine grained" work. Where before the job of the thread pool was to take a large number of independent, coarse-grained, tasks and funnel them onto a few threads, we are increasingly being asked to execute many very small tasks representing tiny pieces of some larger operation.

Anyone who has tried executing such a workload on the existing CLR thread pool has probably found that it's not a simple matter of calling QUWI for each piece of the calculation; with such tiny work items, the overhead of enqueuing and dequeuing the work can be much greater than the work itself, resulting in slower execution than if we had just run the work on a single thread to begin with! It is possible to make this work, by “batching” work into a smaller number of calls to QUWI. There are many strategies for this, all of which are fairly complex in the general case. We would like to make this easy, but the current QUWI is insufficient for this goal.

We can improve this situation in a couple of ways: we can implement a more efficient FIFO queue, and we can enhance the API to allow the user to give us more information, allowing us to turn to even more efficient queuing strategies. For CLR 4.0, we are doing both of these.

Faster FIFO

Recall that the overhead of the existing FIFO queue comes from the expense of allocating and traversing the data structure, and the cost of acquiring the lock on each enqueue and dequeue operation. For 4.0, we are switching to a lock-free data structure with much lower synchronization overhead. More importantly, this new queue is much friendlier to the GC; we still need to allocate a new object for each call to QUWI, but these objects are smaller, and are tracked in large “chunks” which are much easier for the GC to traverse than the simple linked list used previously. This new queue is virtually identical to System.Collections.Concurrent.ConcurrentQueue<T>, which is also new in 4.0.

Improving the performance of QUWI is nice, as it benefits existing applications which use the thread pool without requiring any changes to the application code. How much of a speedup you can expect will depend greatly on many factors, including your application’s workload and the details of the particular hardware on which it executes, but for fine-grained workloads on multi-core hardware the speedup should be significant.

However, we are still restricted in what we can do here – we still have very little information about the work we’re executing, and so we still need to use the same basic strategy to execute it. We can trim overhead here and there, but QUWI will probably never be a great way to execute very fine-grained workloads. We need a new API.

The Task Parallel Library

The Task Parallel Library (TPL) is a collection of new classes specifically designed to make it easier and more efficient to execute very fine-grained parallel workloads on modern hardware. TPL has been available separately as a CTP for some time now, and was included in the Visual Studio 2010 CTP, but in those releases it was built on its own dedicated work scheduler. For Beta 1 of CLR 4.0, the default scheduler for TPL will be the CLR thread pool, which allows TPL-style workloads to “play nice” with existing, QUWI-based code, and allows us to reuse much of the underlying technology in the thread pool - in particular, the thread-injection algorithm, which we will discuss in a future post.

I won’t discuss all of the details of the TPL API, which better covered by its authors. From the point of view of the performance of the thread pool, the important thing about TPL is that it is a much richer API than QUWI, giving the thread pool much more information about the work being executed. In particular, the new Task type exposes the notion of parent/child relationships, giving us some idea of the structure of the overall computation being performed by the individual work items. Having this information opens up possibilities for much more efficient execution of these tasks.

Even without parent/child relationships, Task is a major improvement over QUWI. QUWI returns nothing of use to the caller; it simply queues a delegate, and leaves it up to the implementation of that delegate to coordinate its activities with the rest of the application. QUWI provides no means of waiting for the completion of the work item, for handling exceptions, or getting the result of a computation. Task provides all of this in a very easy-to-use form, while adding very little overhead vs. QUWI.

The fact that Task has a Wait method is not just a convenience; it eliminates one of the most common problems people face when using QUWI. It is fairly common for one work item to need to wait for the execution of another work item to complete. If the second work item has not yet begun executing, it will be sitting in the queue waiting for a worker thread to pick it up. It is possible that there are no available worker threads – maybe they’re all waiting for other work items to complete! This can cause deadlock in the worst case, and very slow execution in the best, as the thread pool may be slow to add more worker threads to pick up these work items. Task.Wait, on the other hand, knows it’s waiting for another task, and is tightly integrated with the thread pool such that it is able to determine whether the task has started executing, and if not it executes it immediately, in-line on the current thread. This greatly improves performance and eliminates the possibility of deadlock in this situation.

For new code, Task is now the preferred way to queue work to the thread pool.

Top-level Tasks have no parent. These are Tasks created by non-thread-pool threads, or with certain options specified at Task-creation time. These tasks are queued to the same FIFO queue we use for QUWI, and thus benefit from the improvements we’ve made there – but they are also subject to the same limitations. Tasks queued in this way are simply a better QUWI – but now the fun starts: A parent task can create child tasks. This happens whenever a Task creates another Task (unless it overrides this behavior). These children are implicitly treated as sub-tasks of the larger task. We assume that sub-tasks can be executed in any order – fairness is not necessary – because all that matters is that the overall operation be completed as fast as possible. This lets us throw those FIFO restrictions out the window, and opens up the possibility for much more efficient work scheduling strategies.

Work Stealing

Since a child task is just a piece of a larger task, we don’t need to worry about execution order. We just need to execute these things quickly. One well-known strategy for fast execution of unordered work items is “work stealing.” Joe Duffy and Daniel Moth explain this very well; click on the links if you’re interested.

The most important aspect of work-stealing is that it enables very fast enqueue and dequeue in the typical case, often requiring no synchronization at all. This virtually eliminates a large part of the overhead of QUWI, when working with child tasks. We still do need to allocate memory for the Task itself, and for the work-stealing queue, but like the improvements to the FIFO queue these data structures have been optimized for good GC performance. Parent tasks are fast; child tasks are much faster.

There are still some limitations to how quickly tasks can be executed. If all tasks are top-level (non-child) tasks, they are subject to the FIFO ordering constraints of QUWI (albeit with much richer functionality). And even with work-stealing, we need to allocate and queue Task instances for every work item. To get even better performance, we need even more information about the work. Which brings us to…

Parallel.For and PLINQ

While not, strictly speaking, features of the CLR thread pool, the methods of the new Parallel class and PLINQ are critical new features of the public concurrency APIs in CLR 4.0. In fine-grained parallel applications, it is very common to need to execute the same code, over and over, with different data inputs. With QUWI or Task, this means allocating and queuing separate workitems for each input. The thread pool infrastructure does not know that all of these work items do the same thing, so it literally has to execute them, one at a time, as if they were completely different tasks.

Parallel.For, Parallel.ForEach, and PLINQ, provide a better way. These are essentially different ways of expressing the same thing: here is some code that needs to execute N times, as quickly as possible.

Just as with the parent/child relationships that Task provides, this extra information enables more aggressive optimization. These “parallel loops” do not need to be broken down into separate work items for each loop iteration. All that is needed is to break them into enough chunks (“partitions”) that they can be efficiently load-balanced across all available machine resources. A typical scenario might be that 1,000,000 iterations need to be broken into, say, four work items.

There is, of course, some overhead introduced by the need to dynamically partition the data (done automatically by the framework). But this pales in comparison to the savings of not having to allocate, queue, and dequeue millions (or more!) work items. In a test I just tried, for a particular work load on one of my machines, Parallel.For executed more than 300 times as fast as the equivalent naïve usage of QUWI.

Where do we go from here?

By now you’ve probably got the general theme: the more information we have about a workload, the faster we are likely to be able to execute it. I expect this theme to continue in future releases. The challenge is to find new ways to easily express (or automatically extract) useful information about parallel workloads, which can be used by the thread pool (or higher-level abstractions like PLINQ) to enable more optimizations. I think this is going to be an interesting space to watch for quite some time.

In the meantime, please try out the new mechanisms we are providing in Visual Studio 2010 Beta 1. Try it with the kinds of workloads you expect to use in production; one of the biggest challenges we face is that we can really only guess at what developers will do with this stuff in the real world, so feedback from our customers is extremely important to ensure that these new mechanisms will meet your needs in the final product.

Feel free to post any questions you may have in the comments for this post; I’ll try to answer what I can. My next post will cover the thread pool’s “thread injection” algorithm, which is how we determine how many threads should be servicing the various queues I’ve discussed here. If there’s something else you’d like me to cover, please let me know.

Comments

Anonymous
April 23, 2009
PingBack from http://asp-net-hosting.simplynetdev.com/clr-40-threadpool-improvements-part-1/
Anonymous
April 23, 2009
The comment has been removed
Anonymous
April 23, 2009
Brandon, It definitely sounds like you may benefit from the improvements in QUWI. It may well be worth trying the naive QUWI solution, which almost certainly will outperform previous CLR versions. Otherwise, given your strict FIFO ordering requirements, there may be little that can be done. Optimization often requires reordering. If you are able to deviate a little from the FIFO goal, you may be able to take advantage of the additional perfomance unlocked by child tasks, or maybe even Parallel.ForEach. I, personally, would try QUWI first, given that this seems to require the least change in your existing code. One of our goals for 4.0 is to provide significant performance improvements even without changes in existing application code. I am very interested in hearing about the results. Once 4.0 Beta 1 has shipped, I hope you will have a chance to try it out. Please let me know how this turns out.
Anonymous
April 24, 2009
Eric, I figured that would be the case. Sounds like the queuing operations in the new ThreadPool will be ultra fast with the use of a lock free queue data structure (presumably using Interlocked methods). Its also nice to hear that this lock free queue will be publically available through the TPL's ConcurrentQueue<T> class. Given the information you provided about the huge performance increase attainable in the new TPL APIs by removing ordering requirements on the work items, we'll probably look for some ways to take advantage of this by looking to paralellize our work as early as possible. One thing that never really got answered from my original post - Is there a way to control how Task objects are created using the new TaskFactory class? If we decide to use the new TPL APIs, it would be nice to be able to return a Task from a pre-allocated pool of our own somehow and get rid of this allocation (I know we'll still have the queue node allocation from the ThreadPool though). I'm looking forward to the 4.0 Beta 1 release. I won't waste any time to test out our performance with the new ThreadPool and will be sure to let you know how it goes. Thanks again for your help Eric. Regards, Brandon
Anonymous
April 24, 2009
Eric - Can you give us a clue as to when VS 2010 Beta 1 will be available?
Anonymous
April 24, 2009
Brandon: Task objects are not currently re-usable. I believe this is partly due to the complications it would add to the programming model, but it also has important performance implications - though not what you might think! References to Task objects must be stored in many different places throughout a Task's lifetime, including the ThreadPool's queues. Much of these data structures are lock-free, as you know. Lock-free manipulation of object references is subject to the "ABA" problem (http://en.wikipedia.org/wiki/ABA_problem). This problem turns out to be a non-problem in a garbage-collected environment, so long as you never reuse objects - but if you try to do that, then you need to deal with this problem. There are techniques for dealing with this, but they inevitably require additional expensive synchronization (and other overhead). So while we do incur some overhead for allocating Task objects, we avoid that synchronization overhead; in the end, it usually works out for the better to not reuse these objects. Jordan: I'm sorry, but I don't have any information for you about the Beta 1 release date. I certainly do wish I did.
Anonymous
April 24, 2009
Eric, Thanks again for the insightful information. Its also interesting to know that the ThreadPool actually uses Task objects now as its "work item". Thanks, Brandon
Anonymous
April 28, 2009
The comment has been removed
Anonymous
April 29, 2009
Eric, I have a few questions please:

What's the limit of Parallel.For loops? I mean, how many iterations can I loop simultaneously? Does it depend on the amount of RAM, CPU?
If my issue is to run 1,000,000 threads in parallel, should I use Parallel.For? or TaskFactory would be better? Thanks a lot,

Anonymous
April 29, 2009
.NET C# COM Object for Use In JavaScript / HTML, Including Event Handling Show me the memory: Tool for
Anonymous
April 29, 2009
.NETC#COMObjectforUseInJavaScript/HTML,IncludingEventHandlingShowmethememory...
Anonymous
April 29, 2009
Thank you for submitting this cool story - Trackback from progg.ru
Anonymous
April 30, 2009
MDFD: Parallel.For is limited by the size of the index variable (i.e., an int can only count to 2^31), but otherwsie is not limited (as far as I know) to some number of iterations. If you have 1,000,000 peices of work to do, it is likely that Parallel.For is what you want. This will not create 1,000,000 threads, but rather will automatically partition the work into manageable chunks which are queued as just a few work items to the thread pool. There is also no reason why you cannot queue 1,000,000 Task objects to the thread pool, though it will use considerably more memory, and will have much higher CPU overhead, than Parallel.For - because we will need to allocate, queue, and dequeue each of those million Tasks. But if you have a million completely different tasks (which seems unlikely) then Task is the way to go - Parallel.For is generally only useful if you want to run the same code against many individual peices of data.
Anonymous
April 30, 2009
Thanks for your reply! I ran the same code against only (!) 1000 pieces of data, but the CPU reached 100% of usage, and got stuck. The whole process was blocked, and never terminated. (I've made several tests, even with less loops - e.g. 500, it sometimes got stuck sometimes not) I must note that the computation in the code (for each thread) is not complicated at all. Computer's configuration: OS Vista Processor: Intel(R) Pentium(R) D CPU 3.40GHz 3.39GHz Ram: 4.00 GB System type:32-bit OS What do you think is the reason? When I ran the code sequentially (by regular "for" statement), everything was all right. Thank you very much
Anonymous
April 30, 2009
miridfd: I'd need to understand better what exactly you're trying to do, before I could say anything about why it's not behaving the way you expect. The best thing would probably be to post a sample that exhibits this behavior on the TPL discussion forum: http://social.msdn.microsoft.com/Forums/en-US/parallelextensions.
Anonymous
May 05, 2009
Eric Eilebrecht , a developer on our team, has just started a multi-part series on TheadPool improvements
Anonymous
May 11, 2009
The comment has been removed
Anonymous
May 12, 2009
The comment has been removed
Anonymous
May 12, 2009
The comment has been removed
Anonymous
May 20, 2009
We’re very excited that the .NET Framework 4 Beta is now available for public download, as .NET 4 has
Anonymous
May 20, 2009
Visual Studio 2010 Beta 1 est disponible pour les abonnés MSDN depuis quelques jours mais maintenant
Anonymous
June 01, 2009
General purpose thread pools are more complicated to get right than you may think. In CLR 4 (the next
Anonymous
June 01, 2009
Charles from Channel 9 stopped by my office a couple of weeks ago to chat with Erika Parsons and I about
Anonymous
June 01, 2009
As we’ve mentioned previously, the .NET ThreadPool has undergone some serious renovations in .NET 4,
Anonymous
July 29, 2009
The comment has been removed
Anonymous
August 20, 2009
The comment has been removed
Anonymous
August 23, 2009
I read more about the thread pool(also the comments here) and you might think about my offer: You said to MDFD about the lot of memory for using a lot of tasks for the threadpool(for 1,000,000). I know you said him to use Parallel.For instead(and I agree) but think about how much memory is wasted. There is the MTA threads and the STA threads and look about the difference for the memory: MTA- using a lot of memory(becouse they many...) STA- using a little of memory(becouse it's the only one - main thread) The thread pool gather a lot of tasks(it's mean a lot of memory) usualy by the STA- main thread but it's doesn't realy matter STA or MTA. The point is that you may take think of making overload to the QUWI that block the current thread until there is a space in the thread pool and just than the current thread will go on. It's mean that now the threadpool don't be overloded by a tasks that can block the current thread! Think how much memory is wasted in the 1,000,000 tasks case! And another thing that is bugging me is the Parallel.For is that it's wait untill the all tasks are done. Why??? I think the answer is becouse if I had quad-core processor it wait becouse we want that we have 4 spaces one for each core(lenght/processor iteration for core) But if the task we wait for is heavy, it's waste of time! I suggest that it will start immediately when there is space for a task. When there is anouther sapce avilable divide the For loop task to 2 tasks in this fashion. I example a process for quad-core: 1.User ask for doing For loop(1,000 iteration). 2.The program wait until there is a free space for task. 3.Another task is done so the For loop begine with i=0 and max of 1,000. 4.Another task is doneand the for loop is currect in 600(unstarted) so the program divide the the remaining 400 in ratio of 1:3 so the max of the current loop is down to 700 and we create task that doing form 700 to 1,000. 5.Another task is done(no matter if it's the loop task or simple task) so the 700 to 1,000 loop is divide by ratio of 1:2 6.The same thing ratio of 1:1. 7.One of loop task is done so we still divided the same. How simple is this! Doesn't it? I hope my point was clear and I didn't make mistakes of my understanding.
Anonymous
February 17, 2012
The comment has been removed

Last updated on 2009-04-23

Share via

CLR 4.0 ThreadPool Improvements: Part 1

Comments

Additional resources