October 2010

Volume 25 Number 10

Thread Pools - Scalable Multithreaded Programming with Thread Pools

By Ron Fosner | October 2010

Programming is getting more challenging, particularly if you’re working in a field that requires you to tune your application for the fastest possible throughput. One contributing factor is that the last few years have seen a change in the way PCs are evolving. Instead of relying on the ever-increasing speed of a single processor, the computational power of PCs is now being distributed across multiple cores. This is a good thing. Hefty increases in latent processing power are now available at relatively low cost, and often with much lower power consumption and cooling needs.

But there’s a downside to the increasing prevalence of multiple-core systems. To use multiple processors, you need to delve into the world of parallel processing. That means it takes more work—and sometimes a significant learning curve—for programmers to take advantage of that latent processing power. It’s possible to throw a few compiler hints into your project and get the compiler to write some multithreaded code for you. However, to take advantage of the full power of multicore PCs, you’ll need to make some changes in the way that you program large jobs.

There are many different ways to distribute your work across multiple cores. One of the easiest and most robust is called task-based programming. Tasks allow you to take your application’s work and spread it across some or all of the CPU cores that are available. With a bit of thoughtful programming, you can minimize or even eliminate any data-dependency or time-synchronization constraints. To achieve this state of multicore bliss, you’ll have to reexamine some of your preconceptions of how to attack a programming problem and rethink them in terms of task-based programming.

To show you how this can work, I’ll walk you through the steps—and the missteps—I took in converting a single-threaded application to one that can scale up to using all of the CPUs available in a computer. In this article, I’ll present some of the concepts of multithreaded programming and show you some simple ways to introduce threaded execution into your code with OpenMP and thread pools. You’ll also see how to use Visual Studio 2010 to measure the improvement in performance you gained from these techniques. In a future article, I’ll build on that foundation to show you more sophisticated multithreaded executing with tasks.

From Threads to Tasks

The major challenge with making a program scale to the number of CPU cores is that you can’t just decide to throw some jobs onto their own threads and let them run. In fact, this is what a lot of folks do, but it only scales well to the number of cores for which the app was designed. It doesn’t scale well with fewer or more than the number of cores that were targeted, and it totally overlooks the built-in obsolescence that such an approach engenders.

A better way to make sure that your application scales well with a varying number of cores is to break larger jobs into smaller, thread-friendly sub-jobs called tasks. The most challenging part of converting a monolithic single-threaded program or a program that has a few dedicated threads into a task-based job system is actually breaking your jobs into tasks.

There are a few guidelines to keep in mind when converting a large single-threaded job into multithreaded tasks:

  • The large job can be arbitrarily divided into 1 to n tasks.
  • The tasks should be able to run in any order.
  • The tasks should be independent of each other.
  • The tasks must have associated context data.

If all these were easy, you’d have no problem making your applications run on any number of cores. Unfortunately not all problems can be broken so neatly into tasks that meet these guidelines.

These guidelines are important because, if followed, they enable you to run each task independently on a thread, with no dependence between tasks. Ideally tasks should be able to be run in any order, so eliminating or reducing interdependencies is paramount to getting a task-based system up and running.

Most real-world programs will go through various stages of processing, with each stage required to complete prior to starting the next stage. These synchronization points are often unavoidable, but with a task-based system the goal is to make sure you take advantage of whatever CPU power is immediately available. With some judicious breaking of large jobs into smaller ones, it’s often possible to start intermingling some completed task results with their next stage of processing while some of the initial tasks are still running.

Simple Multithreading with OpenMP

Before converting your entire application to use tasks, you should be aware of other ways to get some the benefits of multithreading without going through the rigorous exercise of making everything a task. There are numerous ways to incorporate multithreading into your application—many that require little actual effort—but that allow you to benefit from adding multithreading to your code.

OpenMP is one of the simplest ways to add parallel processing to your programs and has been supported by the Visual Studio C++ compiler since 2005. You enable OpenMP by adding pragmas to your code that indicate where and what kind of parallelism you’d like to add. For example, you can add parallelism to a simple Hello World program:

#include <omp.h> // You need this or it won't work
#include <stdio.h>
int main (int argc, char *argv[]) {
  #pragma omp parallel
  printf("Hello World from thread %d on processor %d\n",
    ::GetCurrentThreadfID(), 
    ::GetCurrentProcessorNumber());
  return 0;
}

The OpenMP pragma parallelizes the next block of code—in this case it’s just the printf—and runs it simultaneously across all hardware threads. The number of threads will depend on how many hardware threads are available on the machine. The output is a printf statement running on each hardware thread.

To get any OpenMP program to parallelize (and not silently ignore your OpenMP pragmas), you need to enable OpenMP for your program. First, you have to include the /openmp compiler option (Properties | C/C++ | Language | Open MP Support). Second, you need to include the omp.h header file.

Where OpenMP really shines is when your application spends most of its time looping over functions or data and you want to add multiprocessor support. For example, if you have a for loop that takes a while to execute, you can use OpenMP to easily parallelize the loop. Here’s an example that automatically breaks up array calculations and distributes them across the number of cores that are currently available:

#pragma omp parallel for
for (int i = 0; i < 50000; i++)
  array[i] = i * i;

OpenMP has other constructs that give you more control over the number of threads created, whether the distributed work needs to finish before the next block of code is executed, creating thread-local data, synchronization points, critical sections and so on.

As you can see, OpenMP is an easy way to gently introduce parallelism into an existing code base. However, while the simplicity of OpenMP is attractive, there are times you need more control over what your application is doing, such as when you want the program to dynamically adjust what it’s doing or you need to make sure that a thread stays on a particular core. OpenMP is designed to easily integrate some aspects of multithreaded programming into your program, but it lacks some of the advanced features you’ll find you require to get optimal use of out multiple cores. This is where tasks and thread pools come in.

Using the Thread Pool

Threads require a lot of bookkeeping by the OS, so it’s not a good idea to wantonly create and destroy them. There are not-insignificant costs associated with creating and destroying a thread, so if you do this constantly, it’s easy to lose any advantage you gain by multithreading.

Instead, it’s best to use an existing set of threads, recycled as needed, for all of your threaded activity. This design is called a thread pool, and Windows provides one for you to use. Using this thread pool relieves you of having to deal with thread creation, destruction, and management, all of which is handled for you by the thread pool. OpenMP uses a thread pool to distribute work with threads, and both Windows Vista and Windows 7 provide optimized versions of the thread pool for you to use directly.

While it’s entirely possible to create your own thread pool—which you might need to do if you’ve got some unusual scheduling requirements—you’re much better off using one provided by the OS or the Microsoft .NET Framework.

At this point I need to clarify some terminology. When most people talk about a thread, they’re referring to the flow of execution through a single CPU core—in other words, a software thread. On a CPU, the flow of execution (the actual execution of instructions) occurs on hardware threads. The number of hardware threads is limited by the hardware your application is running on. In the old days, this used to be a single-threaded CPU, but now it’s common to find systems that have at least dual-core processors. A four-core CPU will have the ability to run four hardware threads, or eight if it’s hyperthreaded. High-end desktop systems boast as much as 16 hardware threads, and some server configurations have more than 100!

 While a hardware thread actually runs the instructions, a software thread refers to the entire context—register values, handles, security attributes and so on—required to actually execute the job on a hardware thread. It’s important to note that you can have many more software threads than hardware threads, and this is the underlying basis of a thread pool. It allows you to queue up tasks with software threads and then schedule them to execute on the actual hardware threads.

The advantage of using a thread pool instead of creating your own threads is that the OS will take care of scheduling tasks for you—your job will be to keep feeding tasks into the thread pool so that the OS will be able to keep all of the hardware threads busy. This is illustrated in Figure 1. Everything in the box makes up the thread pool and is out of the realm of the programmer. It’s up to the application to feed tasks into the thread pool, where they are placed into the thread queue and eventually scheduled to run on a hardware thread.

image: A Thread Pool

Figure 1  A Thread Pool

Now you get to the hard part: What’s the best way to structure jobs to keep the cores busy and the CPU utilization at its maximum? This depends on what your application needs to do.

I frequently work with video game companies and these are some of the most challenging types of applications to work with because there’s a lot of work to be done—usually some in a particular serial order—and it’s sensitive to delays. There’s usually a certain frame rate at which the program updates, so if the frame rates start to fall behind, the game experience suffers. As a result, there’s a lot of incentive to maximize use of the hardware.

On the other hand, if your application does one large thing at a time, it’s a little more obvious what you need to do, but it’s still challenging to try to distribute a single job across multiple cores.

Multithreaded Sorting

First let’s take a look at a monolithic job that’s frequently found in applications and see how you can go about transforming it into a more multithread-friendly form. I’m thinking, of course, about sorting. Sorting is a particularly good example because it’s got one major obstacle: How do you sort something and spread out the sorting across multiple cores so that the sorting on one core is independent from what’s being sorted on another core?

A naïve approach frequently seen is to lock access to any data that can be accessed by multiple cores using something like a mutex, semaphore or critical section. This will work. However, if it’s used as a panacea for not correctly scheduling access to shared data, at best you’ll end up killing any gains you might have achieved by blocking other threads from executing. At worst, you might introduce some subtle race condition that will be excruciatingly difficult to track down.

Luckily, you can design the application to eliminate most of the shared access to the data across threads by choosing the appropriate sorting algorithm.

A better approach is to give each core a subsection of the array to sort. This divide-and-conquer method is an easy way to distribute work on the same data set across multiple cores. Algorithms such as merge sort and quick sort work from a divide-and-conquer strategy and are straightforward to implement in a way that takes advantage of a multicore system.

Let’s look at how merge sort works on the string of random integers shown in Figure 2. The first step is to choose a midpoint in the array and divide it into two sublists. Keep dividing until you have lists that are zero or one element long.

image: Sorting a String of Random Integers

Figure 2 Sorting a String of Random Integers

In most implementations, there’s actually a list size limit below which you have an efficient algorithm designed for small lists, but it also works if you just keep dividing until you can’t divide any more. The important thing to note is that once you divide a list into two sublists, the lists are independent. This is illustrated by the dashed red lines in Figure 2. Once you dvide the list into sublists, each sublist is independent and you can give each one to a CPU to manipulate as it likes without having to lock anything.

To make the sorting as efficient as possible, choose an algorithm that will take each sublist and sort it in place. This is important not only to prevent unnecessary copying of the data, but also to keep the data warm in the CPU’s L2 cache. As you strive to write more and more efficient parallel code, you eventually have to be aware of how data gets swapped into and out of the L2 cache, which is generally about 256KB in most modern processors.

There are many sorting algorithms that lend themselves to parallelization. Quick sort, selection sort, merge sort, and radix sort are all algorithms that subdivide the data and operate on it independently. So let’s take a look at a serial implementation of a sorting routine and convert it into a parallel one.

In theory, once you keep subdividing an array recursively you will eventually end up with a single element. At this point, there’s nothing to sort, so the algorithm moves onto the next step, which is merging sorted sublists together. The individual elements are merged into larger, sorted lists. These sorted sublists are then merged into even larger sorted lists until you have the original array in a sorted order. As mentioned previously, it’s usually faster to switch to an algorithm that’s optimized to sort small lists when the list size reaches a certain threshold.

There are many ways to write a sorting algorithm, and I’ve chosen to write a simple quick sort routine in C#, as shown in Figure 3. This program fills a large array with the same sequence of random numbers and then sorts them using a quick sort routine, reporting how long it took.

Figure 3 Quick Sort

using System;
using System.Threading;
using System.Threading.Tasks;
using System.Diagnostics;
namespace ParallelSort {
  class Program {
    // For small arrays, use Insertion Sort
    private static void InsertionSort(
      int[] list, int left, int right) {
      for (int i = left; i < right; i++) {
        int temp = list[i];
        int j = i;
        while ((j > 0) && (list[j - 1] > temp)) {
          list[j] = list[j - 1];
          j = j - 1;
        }
        list[j] = temp;
      }
    }
    private static int Partition(
      int[] array, int i, int j) {
      int pivot = array[i];
      while (i < j) {
        while (array[j] >= pivot && i < j) {
          j--;
        }
        if (i < j) {
          array[i++] = array[j];
        }
        while (array[i] <= pivot && i < j) {
          i++;
        }
        if (i < j) {
          array[j--] = array[i];
        }
      }
      array[i] = pivot;
      return i;
    }
    static void QuickSort(
      int[] array, int left, int right) {
      // Single or 0 elements are already sorted
      if (left >= right)
        return;
      // For small arrays, use a faster serial routine
      if ( right-left <= 32) {
        InsertionSort(array, left, right);
        return;
      }
      // Select a pivot, then quicksort each sub-array
      int pivot = Partition(array, left, right);
      QuickSort(array, left, pivot - 1);
      QuickSort(array, pivot + 1, right);
    }
    static void Main(string[] args) {
      const int ArraySize = 50000000;
      for (int iters = 0; iters < 1; iters++) {
        int[] array;
        Stopwatch stopwatch;
        array = new int[ArraySize];
        Random random1 = new Random(5);
        for (int i = 0; i < array.Length; ++i) {
          array[i] = random1.Next();
        }
        stopwatch = Stopwatch.StartNew();
        QuickSort(array, 0, array.Length - 1);
        stopwatch.Stop();
        // Verify it is sorted
        for (int i = 1; i < array.Length; ++i) 
          if (array[i - 1] > array[i - 1]) 
            throw new ApplicationException("Sort Failed");
        Console.WriteLine("Serialt: {0} ms",  
           stopwatch.ElapsedMilliseconds);
      }
    }
  }
}

If you look at the QuicSort function, you’ll see that it recursively divides the array in two until a threshold is reached, at which point it sorts the list without further subdivision. If you change this to a parallel version, all you have to do is change these two lines:

QuickSort( array, lo, pivot - 1);
QuickSort( array, pivot + 1, hi);

The parallelized version is:

Parallel.Invoke(
  delegate { QuickSort(array, left, pivot - 1); },
  delegate { QuickSort(array, pivot + 1, right); }
);

The Parallel.Invoke interface is part of the Systems.Threading.Tasks namespace found in the .NET Task Parallel Library. It allows you to specify a function to be run asynchronously. In this case, I tell it that I want to run each sorting function on a separate thread.

While it would be more efficient to spawn only one new thread and to use the current thread of execution to sort the other sublist, I wanted to maintain symmetry and illustrate how easy it is to convert a serial program into a parallel program.

Core Utilization

The next obvious question is: Has this parallelization improved performance at all?

Visual Studio 2010 includes several tools to help you understand where your program is spending its time and how it behaves as a multithreaded application. There’s a great introduction to using these tools for measuring the performance of your multithreaded app with Visual Studio 2010 in the September 2009 MSDN Magazine article “Debugging Task-Based Parallel Applications in Visual Studio 2010” by Stephen Toub and Daniel Moth (msdn.microsoft.com/magazine/ee410778), plus there’s a good video introduction by Daniel Moth on Channel 9 (channel9.msdn.com/posts/DanielMoth/Parallel-Tasks--new-Visual-Studio-2010-debugger-window/).

Parallel programming requires you to actually make measurements to verify that you’ve truly improved performance and are utilizing all the hardware. To learn more about how parallelization was being used in my example application, let’s use these tools to measure the sorting routines in action. I launched the Visual Studio 2010 Performance Wizard to take concurrency measurements of my sorting application while it runs.

The first thing you want to look at is core utilization, which shows how the application made use of the CPU cycles available. My test program runs the serial sort, sleeps for a second, then runs the parallel version of the sort. On my 4-core machine I get the graph of core utilization shown in Figure 4. The green is my application, yellow is the OS and other programs, and gray is idle. The flat line at the 1-core level shows that I totally saturate the processing on a single core when I’m running the serial version, and that I get about 2.25 out of four cores when running the parallel version. Not too surprisingly, the time to execute the parallel sort is about 45 percent of the time required for the serial sort. Not too shabby for changing two lines of code.

image: Core Utilization

Figure 4 Core Utilization

Now, let’s switch from looking at the CPU utilization chart to the thread display shown in Figure 5, which shows how the application utilized threads that were available. Notice that there is a single thread for the majority of the execution time. It’s not until you start spawning off tasks that more threads are created. In this display the salmon color indicates a thread that’s been blocked by another thread.

image: Thread Work

Figure 5 Thread Work

In fact, the thread display shows that while I did get a significant increase in execution speed, I didn’t do it very efficiently. It’s perfectly OK to have a thread block, waiting for other threads as the main thread is waiting for the tasks to complete. However, what you really want to see is solid green on as many tasks as you have CPU cores. So even though the CPU utilization chart shows improved utilization of the CPU cores, when you take a closer look at how tasks were spread throughout the thread pool, you see room for further optimization.

In fact, you should always measure your code for performance after you’ve done some multithreading work—even work as simple as I’ve done here. For small jobs, you don’t want to multithread because the overhead will overwhelm any threading performance. For larger jobs, you’d want to break the job up onto as many CPU cores as are available to you so as not to oversubscribe the thread pool.

What Next?

There are a number of ways to squeeze even better performance out of the code, but that’s not the objective of this initial article. But you did see how to get 80 percent CPU utilization by making just a few changes to the code that make it thread-friendly. Rather than optimizing this code further, however, we’re going to focus on getting maximum performance out of the CPUs on a system by architecting jobs a bit differently.

Sorting in the manner I’ve demonstrated here is particularly amenable to multithreading. You can calculate how far you’re going to divide the job and then give each sub-job to a thread. However, while I did get a performance boost, I did leave some performance behind.

But in real applications you might run into a situation where you either have many jobs giving you groups of unique tasks, or possibly not knowing how long any particular task is going to run and having to schedule tasks around this uncertainty. It’s a particularly challenging problem. In my next article, I’m going to take a look at an architecture that takes a holistic approach to threading, allowing you to distribute multiple, perhaps dissimilar jobs. I’ll show you how to architect an application to make it multicore-aware from the start through the use of tasks and thread pools.


Ron Fosner has been optimizing high-performance applications and games on Windows for 20 years and is starting to get the hang of it. He’s a graphics and optimization expert at Intel and is happiest when he sees all CPU cores running flat out. You can reach him at Ron@directx.com.

Thanks to the following technical expert for reviewing this article: Stephen Toub