January 2013

Volume 28 Number 01

Windows with C++ - The Evolution of Threads and I/O in Windows

By Kenny Kerr | January 2013

Kenny KerrWhen starting a new project, do you ask yourself whether your program will be compute-bound or I/O-bound? You should. I’ve found that in most cases it’s either one or the other. You might be working on an analytics library that’s handed a pile of data and keeps a bunch of processors busy while crunching it all down to a set of aggregates. Alternatively, your code might spend most of its time waiting for stuff to happen, for data to arrive over the network, a user to click on something or perform some complex six-fingered gesture. In this case, the threads in your program aren’t doing very much. Sure, there are cases where programs are heavy both on I/O and computation. The SQL Server database engine comes to mind, but that’s less typical of today’s computer programming. More often than not, your program is tasked with coordinating the work of others. It might be a Web server or client communicating with a SQL database, pushing some computation to the GPU or presenting some content for the user to interact with. Given all of these different scenarios, how do you decide what threading capabilities your program requires and what concurrency building blocks are necessary or useful? Well, that’s a tough question to answer generally and something you’ll need to analyze as you approach a new project. It’s helpful, however, to understand the evolution of threading in Windows and C++ so you can make an informed decision based on the practical choices that are available.

Threads, of course, provide no direct value whatsoever to the user. Your program is no more awesome if you use twice as many threads as another program. It’s what you do with those threads that counts. To illustrate these ideas and the way in which threading has evolved over time, let me take the example of reading some data from a file. I’ll skip over the C and C++ libraries because their support for I/O is mostly geared toward synchronous, or blocking, I/O, and this is typically not of interest unless you’re building a simple console program. Of course, there’s nothing wrong with that. Some of my favorite programs are console programs that do one thing and do it really well. Still, that’s not very interesting, so I’ll move on.

One Thread

To begin with, I’ll start with the Windows API and the good old—and even aptly named—ReadFile function. Before I can start reading the contents of a file, I need a handle to the file, and this handle is provided by the immensely powerful CreateFile function:

auto fn = L"C:\\data\\greeting.txt";
auto f = CreateFile(fn, GENERIC_READ, FILE_SHARE_READ, nullptr,

To keep the examples brief, I’ll just use the ASSERT and VERIFY macros as placeholders to indicate where you’ll need to add some error handling to manage any failures reported by the various API functions. In this code snippet, the CreateFile function is used to open rather than create the file. The same function is used for both operations. The Create in the name is more about the fact that a kernel file object is created, and not so much about whether or not a file is created in the file system. The parameters are pretty self-explanatory and not too relevant to this discussion, with the exception of the second-to-last, which lets you specify a set of flags and attributes indicating the type of I/O behavior you need from the kernel. In this case I used the FILE_ATTRIBUTE_NORMAL constant, which just indicates that the file should be opened for normal synchronous I/O. Remember to call the CloseHandle function to release the kernel’s lock on the file when you’re done with it. A handle wrapper class, such as the one I described in my July 2011 column, “C++ and the Windows API” (msdn.microsoft.com/magazine/hh288076), will do the trick.

I can now go ahead and call the ReadFile function to read the contents of the file into memory:

char b[64];
VERIFY(ReadFile(f, b, sizeof(b), &c, nullptr));
printf("> %.*s\n", c, b);

As you might expect, the first parameter specifies the handle to the file. The next two describe the memory into which the file’s contents should be read. ReadFile will also return the actual number of bytes copied should there be fewer bytes available than were requested. The final parameter is only used for asynchronous I/O, and I’ll get back to that in a moment. In this simplistic example, I then simply print out the characters that were actually read from the file. Naturally, you might need to call ReadFile multiple times if required.

Two Threads

This model of I/O is simple to grasp and certainly quite useful for many small programs, particularly console-based programs. But it doesn’t scale very well. If you need to read two separate files at the same time, perhaps to support multiple users, you’ll need two threads. No problem—that’s what the CreateThread function is for. Here’s a simple example:

auto t = CreateThread(nullptr, 0, [] (void *) -> DWORD
  return 0;
nullptr, 0, nullptr);
WaitForSingleObject(t, INFINITE);

Here I’m using a stateless lambda instead of a callback function to represent the thread procedure. The Visual C++ 2012 compiler conforms to the C++11 language specification in that stateless lambdas must be implicitly convertible to function pointers. This is convenient, and the Visual C++ compiler does one better by automatically producing the appropriate calling convention on the x86 architecture, which sports a variety of calling conventions.

The CreateThread function returns a handle representing a thread that I then wait on using the WaitForSingleObject function. The thread itself blocks while the file is read. In this way I can have multiple threads performing different I/O operations in tandem. I could then call WaitForMultipleObjects to wait until all of the threads have finished. Remember also to call CloseHandle to release the thread-related resources in the kernel.

This technique, however, doesn’t scale beyond a handful of users or files or whatever the scalability vector is for your program. To be clear, it’s not that multiple outstanding read operations don’t scale. Quite the opposite. But it’s the threading and synchronization overhead that will kill the program’s scalability.

Back to One Thread

One solution to this problem is to use something called alertable I/O via asynchronous procedure calls (APCs). In this model your program relies on a queue of APCs that the kernel associates with each thread. APCs come in both kernel- and user-mode varieties. That is, the procedure, or function, that’s queued might belong to a program in user mode, or even to some kernel-mode driver. The latter is a simple way for the kernel to allow a driver to execute some code in the context of a thread’s user-mode address space so that it has access to its virtual memory. But this trick is also available to user-mode programmers. Because I/O is fundamentally asynchronous in hardware anyway (and thus in the kernel), it makes sense to begin reading the file’s contents and have the kernel queue an APC when it eventually completes.

To begin with, the flags and attributes passed to the CreateFile function must be updated to allow the file to provide overlapped I/O so that operations on the file aren’t serialized by the kernel. The terms asynchronous and overlapped are used interchangeably in the Windows API, and they mean the same thing. Anyway, the FILE_FLAG_OVERLAPPED constant must be used when creating the file handle:

auto f = CreateFile(fn, GENERIC_READ, FILE_SHARE_READ, nullptr,

Again, the only difference in this code snippet is that I replaced the FILE_ATTRIBUTE_NORMAL constant with the FILE_FLAG_OVERLAPPED one, but the difference at run time is huge. To actually provide an APC that the kernel can queue at I/O completion, I need to use the alternative ReadFileEx function. Although ReadFile can be used to initiate asynchronous I/O, only ReadFileEx lets you provide an APC to call when it completes. The thread can then go ahead and do other useful work, perhaps starting additional asynchronous operations, while the I/O completes in the background.

Again, thanks to C++11 and Visual C++, a lambda can be used to represent the APC. The trick is that the APC will likely want to access the newly populated buffer, but this isn’t one of the parameters to the APC, and because only stateless lambdas are allowed, you can’t use the lambda to capture the buffer variable. The solution is to hang the buffer off the OVERLAPPED structure, so to speak. Because a pointer to the OVERLAPPED structure is available to the APC, you can then simply cast the result to a structure of your choice. Figure 1 provides a simple example.

Figure 1 Alertable I/O with APCs

struct overlapped_buffer
  char b[64];
overlapped_buffer ob = {};
VERIFY(ReadFileEx(f, ob.b, sizeof(ob.b), &ob.o, [] (DWORD e, DWORD c,
  auto ob = reinterpret_cast<overlapped_buffer *>(o);
  printf("> %.*s\n", c, ob->b);
SleepEx(INFINITE, true);

In addition to the OVERLAPPED pointer, the APC is also provided an error code as its first parameter and the number of bytes copied as its second. At some point, the I/O completes, but in order for the APC to run, the same thread must be placed in an alertable state. The simplest way to do that is with the SleepEx function, which wakes up the thread as soon as an APC is queued and executes any APCs before returning control. Of course, the thread may not be suspended at all if there are already APCs in the queue. You can also check the return value from SleepEx to find out what caused it to resume. You can even use a value of zero instead of INFINITE to flush the APC queue before proceeding without delay.

Using SleepEx is not, however, all that useful and can easily lead unscrupulous programmers to poll for APCs, and that’s never a good idea. Chances are that if you’re using asynchronous I/O from a single thread, this thread is also your program’s message loop. Either way, you can also use the MsgWaitForMultipleObjectsEx function to wait for more than just APCs and build a more compelling single-threaded runtime for your program. The potential drawback with APCs is that they can introduce some challenging re-entrancy bugs, so do keep that in mind.

One Thread per Processor

As you find more things for your program to do, you might notice that the processor on which your program’s thread is running is getting busier while the remaining processors on the computer are sitting around waiting for something to do. Although APCs are about the most efficient way to perform asynchronous I/O, they have the obvious drawback of only ever completing on the same thread that initiated the operation. The challenge then is to develop a solution that can scale this to all of the available processors. You might conceive of a design of your own doing, perhaps coordinating work among a number of threads with alertable message loops, but nothing you could implement would come close to the sheer performance and scalability of the I/O completion port, in large part because of its deep integration with different parts of the kernel.

While an APC allows asynchronous I/O operations to complete on a single thread, a completion port allows any thread to begin an I/O operation and have the results processed by an arbitrary thread. A completion port is a kernel object that you create before associating it with any number of file objects, sockets, pipes and more. The completion port exposes a queuing interface whereby the kernel can push a completion packet onto the queue when I/O completes and your program can dequeue the packet on any available thread and process it as needed. You can even queue your own completion packets if needed. The main difficulty is getting around the confusing API. Figure 2 provides a simple wrapper class for the completion port, making it clear how the functions are used and how they relate.

Figure 2 The Completion Port Wrapper

class completion_port
  completion_port(completion_port const &);
  completion_port & operator=(completion_port const &);
  explicit completion_port(DWORD tc = 0) :
    h(CreateIoCompletionPort(INVALID_HANDLE_VALUE, nullptr, 0, tc))
  void add_file(HANDLE f, ULONG_PTR k = 0)
    VERIFY(CreateIoCompletionPort(f, h, k, 0));
  void queue(DWORD c, ULONG_PTR k, OVERLAPPED * o)
    VERIFY(PostQueuedCompletionStatus(h, c, k, o));
  void dequeue(DWORD & c, ULONG_PTR & k, OVERLAPPED *& o)
    VERIFY(GetQueuedCompletionStatus(h, &c, &k, &o, INFINITE));

The main confusion is around the double duty that the Create­IoCompletionPort function performs, first actually creating a completion port object and then associating it with an overlapped file object. The completion port is created once and then associated with any number of files. Technically, you can perform both steps in a single call, but this is only useful if you use the completion port with a single file, and where’s the fun in that?

When creating the completion port, the only consideration is the last parameter indicating the thread count. This is the maximum number of threads that will be allowed to dequeue completion packets concurrently. Setting this to zero means that the kernel will allow one thread per processor.

Adding a file is technically called an association; the main thing to note is the parameter indicating the key to associate with the file. Because you can’t hang extra information off the end of a handle like you can with an OVERLAPPED structure, the key provides a way for you to associate some program-specific information with the file. Whenever the kernel queues a completion packet related to this file, this key will also be included. This is particularly important because the file handle isn’t even included in the completion packet.

As I said, you can queue your own completion packets. In this case, the values you provide are entirely up to you. The kernel doesn’t care and won’t try to interpret them in any way. Thus, you can provide a bogus OVERLAPPED pointer and the exact same address will be stored in the completion packet.

In most cases, however, you’ll wait for the kernel to queue the completion packet once an asynchronous I/O operation completes. Typically a program creates one or more threads per processor and calls GetQueuedCompletionStatus, or my dequeue wrapper function, in an endless loop. You might queue a special control completion packet—one per thread—when your program needs to come to an end and you want these threads to terminate. As with APCs, you can hang more information off the OVERLAPPED structure to associate extra information with each I/O operation:

completion_port p;
overlapped_buffer ob = {};
ReadFile(f, ob.b, sizeof(ob.b), nullptr, &ob.o);

Here I’m again using the original ReadFile function, but in this case I’m providing a pointer to the OVERLAPPED structure as its last parameter. A waiting thread might dequeue the completion packet as follows:

p.dequeue(c, k, o);
auto ob = reinterpret_cast<overlapped_buffer *>(o);

A Pool of Threads

If you’ve been following my column for some time, you’ll recall that I spent five months last year covering the Windows thread pool in detail. It will also be no surprise to you that this same thread pool API is implemented using I/O completion ports, providing this same work-queuing model but without the need to manage the threads yourself. It also provides a host of features and conveniences that make it a compelling alternative to using the completion port object directly. If you haven’t already done so, I encourage you to read those columns to get up to speed on the Windows thread pool API. A list of my online columns is available at bit.ly/StHJtH.

At a minimum, you can use the TrySubmitThreadpoolCallback function to get the thread pool to create one of its work objects internally and have the callback immediately submitted for execution. It doesn’t get much simpler than this:

TrySubmitThreadpoolCallback([](PTP_CALLBACK_INSTANCE, void *)
  // Work goes here!
nullptr, nullptr);

If you need a bit more control, you can certainly create a work object directly and associate it with a thread pool environment and cleanup group. This will also give you the best possible performance.

Of course, this discussion is about overlapped I/O, and the thread pool provides I/O objects just for that. I won’t spend much time on this, as I’ve already covered it in detail in my December 2011 column, “Thread Pool Timers and I/O” (msdn.microsoft.com/magazine/hh580731), but Figure 3 provides a new example.

Figure 3 Thread Pool I/O

char b[64];
auto io = CreateThreadpoolIo(f, [] (PTP_CALLBACK_INSTANCE, 
  void * b,   void *, ULONG e, ULONG_PTR c, PTP_IO)
  printf("> %.*s\n", c, static_cast<char *>(b));
b, nullptr);
auto r = ReadFile(f, b, sizeof(b), nullptr, &o);
if (!r && ERROR_IO_PENDING != GetLastError())
WaitForThreadpoolIoCallbacks(io, false);

Given that CreateThreadpoolIo lets me pass an additional context parameter to the queued callback, I don’t need to hang the buffer off the OVERLAPPED structure, although I could certainly do that if needed. The main things to keep in mind here are that StartThreadpoolIo must be called prior to beginning the asynchronous I/O operation, and CancelThreadpoolIo must be called should the I/O operation fail or complete inline, so to speak.

Fast and Fluid Threads

Taking the concept of a thread pool to new heights, the new Windows API for Windows Store apps also provides a thread pool abstraction, although a much simpler one with far fewer features. Fortunately, nothing prevents you from using an alternative thread pool appropriate for your compiler and libraries. Whether you’ll get it past the friendly Windows Store curators is another story. Still, the thread pool for Windows Store apps is worth mentioning, and it integrates the asynchronous pattern embodied by the Windows API for Windows Store apps.

Using the slick C++/CX extensions provides a relatively simple API for running some code asynchronously:

ThreadPool::RunAsync(ref new WorkItemHandler([] (IAsyncAction ^)
  // Work goes here!

Syntactically this is pretty straightforward. We can even hope that this will become simpler in a future version of Visual C++ if the compiler can automatically generate a C++/CX delegate from a lambda—at least conceptually—the same as it does today for function pointers.

Still, this relatively simple syntax belies a great deal of complexity. At a high level, ThreadPool is a static class, to borrow a term from the C# language, and thus can’t be created. It provides a few overloads of the static RunAsync method, and that’s it. Each takes at least a delegate as its first parameter. Here I’m constructing the delegate with a lambda. The RunAsync methods also return an IAsyncAction interface, providing access to the asynchronous operation.

As a convenience, this works pretty well and integrates nicely into the asynchronous programming model that pervades the Windows API for Windows Store apps. You can, for example, wrap the IAsyncAction interface returned by the RunAsync method in a Parallel Patterns Library (PPL) task and achieve a level of composability similar to what I described in my September and October columns, “The Pursuit of Efficient and Composable Asynchronous Systems” (msdn.microsoft.com/magazine/jj618294) and “Back to the Future with Resumable Functions” (msdn.microsoft.com/magazine/jj658968).

However, it’s useful and somewhat sobering to realize what this seemingly innocuous code really represents. At the heart of the C++/CX extensions is a runtime based on COM and its IUnknown interface. Such an interface-based object model can’t possibly provide static methods. There has to be an object for there to be an interface, and some sort of class factory to create that object, and indeed there is.

The Windows Runtime defines something called a runtime class that’s very much akin to a traditional COM class. If you’re old-school, you could even define the class in an IDL file and run it through a new version of the MIDL compiler specifically suited to the task, and it will generate .winmd metadata files and the appropriate headers.

A runtime class can have both instance methods and static methods. They’re defined with separate interfaces. The interface containing the instance methods becomes the class’s default interface, and the interface containing the static methods is attributed to the runtime class in the generated metadata. In this case the ThreadPool runtime class lacks the activatable attribute and has no default interface, but once created, the static interface can be queried for, and then those not-so-static methods may be called. Figure 4 gives an example of what this might entail. Keep in mind that most of this would be compiler-generated, but it should give you a good idea of what it really costs to make that simple static method call to run a delegate asynchronously.

Figure 4 The WinRT Thread Pool

class WorkItemHandler :
  public RuntimeClass<RuntimeClassFlags<ClassicCom>,
  virtual HRESULT __stdcall Invoke(IAsyncAction *)
    // Work goes here!
    return S_OK;
auto handler = Make<WorkItemHandler>();
HSTRING clsid;
auto hr = WindowsCreateStringReference(
  - 1, &header, &clsid);
ASSERT(S_OK == hr);
ComPtr<IThreadPoolStatics> tp;
hr = RoGetActivationFactory(
  clsid, __uuidof(IThreadPoolStatics),
  reinterpret_cast<void **>(tp.GetAddressOf()));
ASSERT(S_OK == hr);
ComPtr<IAsyncAction> a;
hr = tp->RunAsync(handler.Get(), a.GetAddressOf());
ASSERT(S_OK == hr);

This is certainly a far cry from the relative simplicity and efficiency of calling the TrySubmitThreadpoolCallback function. It’s helpful to understand the cost of the abstractions you use, even if you end up deciding that the cost is justified given some measure of productivity. Let me break it down briefly.

The WorkItemHandler delegate is actually an IUnknown-based IWorkItemHandler interface with a single Invoke method. The implementation of this interface isn’t provided by the API but rather by the compiler. This makes sense because it provides a convenient container for any variables captured by the lambda and the lambda’s body would naturally reside within the compiler-generated Invoke method. In this example I’m simply relying on the Windows Runtime Library (WRL) RuntimeClass template class to implement IUnknown for me. I can then use the handy Make template function to create an instance of my WorkItemHandler. For stateless lambdas and functions pointers, I’d further expect the compiler to produce a static implementation with a no-op implementation of IUnknown to avoid the overhead of dynamic allocation.

To create an instance of the runtime class, I need to call the RoGet­ActivationFactory function. However, it needs a class ID. Notice that this isn’t the CLSID of traditional COM but rather the fully qualified name of the type, in this case, Windows.System.Threading.ThreadPool. Here I’m using a constant array generated by the MIDL compiler to avoid having to count the string at run time. As if that weren’t enough, I need to also create an HSTRING version of this class ID. Here I’m using the WindowsCreateStringReference function, which, unlike the regular WindowsCreateString function, doesn’t create a copy of the source string. As a convenience, WRL also provides the HStringReference class that wraps up this functionality. I can now call the RoGetActivationFactory function, requesting the IThreadPoolStatics interface directly and storing the resulting pointer in a WRL-provided smart pointer.

I can now finally call the RunAsync method on this interface, providing it with my IWorkItemHandler implementation as well as the address of an IAsyncAction smart pointer representing the resulting action object.

It’s then perhaps not surprising that this thread pool API doesn’t provide anywhere near the amount of functionality and flexibility provided by the core Windows thread pool API or the Concurrency Runtime. The benefit of C++/CX and the runtime classes is, however, realized along the boundaries between the program and the runtime itself. As a C++ programmer, you can be thankful that Windows 8 isn’t an entirely new platform and the traditional Windows API is still at your disposal, if and when you need it.

Kenny Kerr is a software craftsman with a passion for native Windows development. Reach him at kennykerr.ca.

Thanks to the following technical expert for reviewing this article: James P. McNellis