August 2011

Volume 26 Number 08

Windows With C++ - The Windows Thread Pool and Work

By Kenny Kerr | August 2011

Kenny KerrConcurrency means different things to different people. Some folks think in terms of agents and messages—cooperating but asynchronous state machines. Others think in terms of tasks, usually in the form of functions or expressions that may execute concurrently. Still others think in terms of data parallelism, where the structure of the data enables concurrency. You might even consider these complementary or overlapping techniques. Regardless of how you view the world of concurrency, at the heart of any contemporary approach to concurrency is a thread pool of one form or another.

Threads are relatively expensive to create. An excessive number of threads introduces scheduling overhead that affects cache locality and overall performance. In most well-designed systems, the unit of concurrency is relatively short-lived. Ideally, there’s a simple way to create threads as needed, reuse them for additional work and avoid creating too many threads in some intelligent way in order to use the available computing power efficiently. Fortunately, that ideal exists today, and not in some third-party library, but right in the heart of the Windows API. Not only does the Windows thread pool API meet these requirements, but it also integrates seamlessly with many of the core building blocks of the Windows API. It takes much of the complexity out of writing scalable and responsive applications. If you’re a longtime Windows developer, you’re undoubtedly familiar with the cornerstone of Windows scalability that’s the I/O completion port. Take comfort in the fact that an I/O completion port sits at the heart of the Windows thread pool.

Keep in mind that a thread pool shouldn’t be viewed simply as a way to avoid calling CreateThread with all of its parameters and the requisite call to CloseHandle on the resulting handle. Sure, this may be convenient, but it can also be misleading. Most developers have expectations about the priority-driven, preemptive scheduling model that Windows implements. Threads at the same priority will typically share processor time. When a thread’s quantum—the amount of time it gets to run—comes to an end, Windows determines whether another thread with the same priority is ready to execute. Naturally, many factors influence thread scheduling, but given two threads that are created around the same time, with the same priority, both performing some compute-bound operation, one would expect them both to begin executing within a few quantums of each other.

Not so with the thread pool. The thread pool—and really any scheduling abstraction based on an I/O completion port—relies on a work-queuing model. The thread pool guarantees full core utilization but also prevents overscheduling. If two units of work are submitted around the same time on a single-core machine, then only the first is dispatched. The second will only start if the first finishes or blocks. This model is optimal for throughput because work will execute more efficiently with fewer interruptions, but it also means that there are no latency guarantees.

The thread pool API is designed as a set of cooperating objects. There are objects representing units of work, timers, asynchronous I/O and more. There are even objects representing challenging concepts such as cancellation and cleanup. Fortunately, the API doesn’t force developers to deal with all of these objects and, much like a buffet, you can consume as little or as much as needed. Naturally, this freedom introduces the risk of using the API inefficiently or in an inappropriate way. That’s why I’ll be spending the next few months on it in this column. As you begin to grasp the different roles that the various parts of the API play, you’ll discover that the code you need to write gets simpler rather than more complex.

In this first installment, I’m going to show you how to start submitting work to the thread pool. Functions are exposed to the thread pool as work objects. A work object consists of a function pointer as well as a void pointer, called a context, which the thread pool passes to the function every time it’s executed. A work object can be submitted multiple times for execution, but the function and context can’t be changed without creating a new work object.

The CreateThreadpoolWork function creates a work object. If the function succeeds, it returns an opaque pointer representing the work object. If it fails, it returns a null pointer value and provides more information via the GetLastError function. Given a work object, the CloseThreadpoolWork function informs the thread pool that the object may be released. This function doesn’t return a value, and for efficiency assumes the work object is valid. Fortunately, the unique_handle class template I introduced in last month’s column takes care of this. Here’s a traits class that can be used with unique_handle, as well as a typedef for convenience:

struct work_traits
{
  static PTP_WORK invalid() throw()
  {
    return nullptr;
  }

  static void close(PTP_WORK value) throw()
  {
    CloseThreadpoolWork(value);
  }
};

typedef unique_handle<PTP_WORK, work_traits> work;

I can now create a work object and let the compiler take care of its lifetime, whether the object resides on the stack or in a container. Of course, before I can do so, I need a function for it to call, known as a callback. The callback is declared as follows:

void CALLBACK hard_work(PTP_CALLBACK_INSTANCE, void * context, PTP_WORK);

The CALLBACK macro ensures that the function implements the appropriate calling convention that the Windows API expects for callbacks, depending on the target platform. Creating a work object for this callback using the work typedef is straightforward and continues the pattern I highlighted in last month’s column, as shown here:

void * context = ... 
work w(CreateThreadpoolWork(hard_work, context, nullptr));
check_bool(w);

At this point, all I have is an object that represents some work to perform, but the thread pool itself isn’t yet involved, as the work callback hasn’t been submitted for execution. The SubmitThreadpoolWork function submits the work callback to the thread pool. It may be called multiple times with the same work object to allow multiple callbacks to run concurrently. The function is shown here:

SubmitThreadpoolWork(w.get());

Of course, even submitting the work doesn’t guarantee its prompt execution. The work callback is queued, but the thread pool may limit the level of concurrency—the number of worker threads—to improve efficiency. As this is all rather unpredictable, there needs to be a way to wait for outstanding callbacks, both those that may be currently executing as well as those that are still pending. Ideally, it would also be possible to cancel those work callbacks that have yet to be given an opportunity to execute. Usually any sort of blocking “wait” operation is bad news for concurrency, but it’s still necessary in order to perform predictable cancelation and shutdown. That’s the topic of an upcoming column, so I won’t spend much more time on it here. However, for now, the WaitForThreadpoolWorkCallbacks function meets the aforementioned requirements. Here’s an example:

bool cancel = ...
WaitForThreadpoolWorkCallbacks(w.get(), cancel);

The value of the second parameter determines whether pending callbacks will be canceled or whether the function waits for them to complete even if they haven’t yet begun to execute. I now have enough to build a basic functional pool, taking the thread pool API and a sprinkling of C++ 2011 to build something that’s a lot more enjoyable to use. Moreover, it provides a good example for using all of the functions I’ve introduced thus far.

A simple functional pool should allow me to submit a function to execute asynchronously. I should be able to define this function using a lambda expression, a named function or a function object, as needed. One approach is to use a concurrent collection to store a queue of functions, passing this queue to a work callback. Visual C++ 2010 includes the concurrent_queue class template that will do the trick. I’m assuming that you’re using the updated implementation from Service Pack 1, as the original had a bug that resulted in an access violation if the queue wasn’t empty upon destruction.

I can go ahead and start defining the functional pool class as follows:

class functional_pool
{
  typedef concurrent_queue<function<void()>> queue;

  queue m_queue;
  work m_work;

  static void CALLBACK callback(PTP_CALLBACK_INSTANCE, void * context, PTP_WORK)
  {
    auto q = static_cast<queue *>(context);

    function<void()> function;
    q->try_pop(function);

    function();
  }

As you can see, the functional_pool class manages a queue of function objects as well as a single work object. The callback assumes that the context is a pointer to the queue and further assumes that at least one function is present in the queue. I can now create the work object for this callback and set the context appropriately, as shown here:

public:

  functional_pool() :
    m_work(CreateThreadpoolWork(callback, &m_queue, nullptr))
  {
    check_bool(m_work);
  }

A function template is needed to cater to the various types of functions that may be submitted. Its job is simply to queue the function and call SubmitThreadpoolWork to instruct the thread pool to submit the work callback for execution, as shown here:

template <typename Function>
void submit(Function const & function)
{
  m_queue.push(function);
  SubmitThreadpoolWork(m_work.get());
}

Finally, the functional_pool destructor needs to ensure that no further callbacks will execute before allowing the queue to be destroyed, otherwise horrible things will happen. Here’s an example:

~functional_pool()
{
  WaitForThreadpoolWorkCallbacks(m_work.get(), true);
}

I can now create a functional_pool object and submit work quite simply using a lambda expression:

functional_pool pool;

pool.submit([]
{
  // Do this asynchronously
});

Clearly, there’s going to be some performance penalty for explicitly queuing functions and implicitly queuing work callbacks. Using this approach in server applications, where the concurrency is typically quite structured, would probably not be a good idea. If you have only a handful of unique callbacks that handle the bulk of your asynchronous workloads, you’re probably better off just using function pointers. This approach may be useful in client applications, however. If there are many different short-lived operations that you’d like to handle concurrently to improve responsiveness, the convenience of using lambda expressions tends to be more significant.

Anyway, this article isn’t about lambda expressions but about submitting work to the thread pool. A seemingly simpler approach for achieving the same end is provided by the TrySubmitThreadpoolCallback function, as shown here:

void * context = ...
check_bool(TrySubmitThreadpoolCallback(
  simple_work, context, nullptr));

It’s almost as if the CreateThreadpoolWork and SubmitThreadpoolWork functions have been rolled into one, and that’s essentially what’s happening. The TrySubmitThreadpoolCallback function causes the thread pool to create a work object internally whose callback is immediately submitted for execution. Because the thread pool owns the work object, you don’t have to concern yourself with releasing it. Indeed, you can’t, because the work object is never exposed by the API. The callback’s signature provides further evidence, as shown here:

void CALLBACK simple_work(
  PTP_CALLBACK_INSTANCE, void * context);

The callback looks much the same as before except for the missing third parameter. At first, this seems ideal: a simpler API and less to worry about. However, there’s no obvious way to wait for the callback to complete, let alone to cancel it. Trying to write the functional_pool class in terms of TrySubmitThreadpoolCallback would be problematic and require additional synchronization. An upcoming column addresses how this can be achieved using the thread pool API. Even if you were able to solve these issues, a less obvious problem exists that’s potentially far more devastating in practice. Every call to TrySubmitThreadpoolCallback involves the creation of a new work object with its associated resources. With heavy workloads, this can quickly cause the thread pool to consume a great deal of memory and result in further performance penalties.

Using a work object explicitly also provides other benefits. The callback’s final parameter in its original form provides a pointer to the same work object that submitted the running instance. You can use it to queue up additional instances of the same callback. You can even use it to release the work object. However, these sorts of tricks can get you into trouble, as it becomes increasingly difficult to know when it’s safe to submit work and when it’s safe to release application resources. In next month’s column, I’ll examine the thread pool environment as I continue to explore the Windows thread pool API.       


Kenny Kerris a software craftsman with a passion for native Windows development. Reach him at kennykerr.ca.

Thanks to the following technical experts for reviewing this article:Hari Pulapaka and Pedro Teixeira