CONCURRENCY

Synchronization Primitives New To Windows Vista

Robert Saccone and Alexander Taskov

Code download available at: Vista Synchronization 2007_06.exe

This article discusses:

  • Condition variables
  • Slim reader/writer lock
  • One-time initialization
  • Double-checked locking
This article uses the following technologies:
Windows Vista, C++

Contents

Condition Variables
Slim Reader/Writer Lock
One-Time Initialization
Conclusion

Windows Vista ships with a raft of new and exciting technologies for developers, including Windows® Presentation Foundation, Windows Communication Foundation, and Windows Workflow Foundation. In fact, Windows Vista™ has introduced so many new .NET-friendly technologies that it is easy to overlook all the new features and functionalities that native C/C++ developers can put to work in their apps.

In this article, we discuss some of the new capabilities in Windows Vista that impact native C/C++ developers. Specifically, we focus on several of the new thread synchronization features introduced with the new operating system: condition variables, slim reader/writer locks, and one-time initialization.

Condition Variables

The condition variable has existed in other threading libraries for some time and has been sorely missed from the Windows SDK. Basically, a condition variable is used to synchronize a group of threads based on the result of some conditional test. While this can be done using a combination of existing synchronization constructs, the condition variable has the ability to release an acquired lock and go into a sleep state in a single atomic operation. It also provides a much clearer and less error-prone way to implement the desired behavior.

The Windows SDK for Windows Vista (available for download) exposes condition variables with the CONDITION_VARIABLE structure. You can use the InitializeConditionVariable function to set up the structure. There is no function to clean up or destroy the CONDITION_VARIABLE structure as it’s not needed by the underlying implementation.

Using the functions SleepConditionVariableCS (when a critical section is used) or SleepConditionVariableSRW (when a slim reader/writer lock is used) functions, you can have threads wait on a condition variable. These sleeping threads will be released when another thread calls WakeConditionVariable or WakeAllConditionVariable, depending on whether the calling thread wants one thread to be released or all threads waiting on the condition variable to be released.

The common producer/consumer problem represents a scenario where condition variables would come in handy. This classic example refers to a scenario where a producer generates data, placing it in a buffer, while a consumer grabs pieces of data from the buffer to be processed. The problem refers to the need to make sure the producer doesn’t try to add data to a full buffer and the consumer doesn’t try to grab data from an empty buffer. We will work through this scenario to show you how the condition variable can help.

For this example, we will create a single producer thread that posts numerical data to a shared queue. We will then create five consumer threads. Each consumer thread will remove an item from the queue and process it. When finished with the current piece of data, the consumer thread will loop around, repeating the process infinitely.

In earlier versions of Windows, the producer/consumer problem could be solved with a combination of Win32 events and a critical section. The critical section protects the shared resource from concurrent access and an event signals when the resource is available for the consumer.

In our first attempt at solving this problem, we use a Standard Template Library (STL) list of integers as the shared resource. Since the list expands dynamically, we don’t need an event to signal when the list is not full; we only need to know when it’s not empty so the consumer knows that it has something to consume. (If you were to use a fixed-size array for the shared queue, then you would need a not-full event to make sure you don’t overwrite the bounds of the buffer.) We then declare and initialize a CRITICAL_SECTION object and an auto-reset event, which is used to indicate when the list is not empty.

The producer thread shown in Figure 1 will first try to acquire the critical section and, if it succeeds, will then insert an integer value at the end of the shared list. The thread then releases the critical section and sets the not-empty event. Only one of the threads waiting on this event will be released since we are using an auto-reset event. The consumer thread shown in Figure 1 will check to see if the queue is empty. If the queue is not empty, the thread will remove an item and release the critical section. If the queue is empty, the consumer thread will go back to sleep, waiting on the not-empty event. While the first consumer thread is busy processing the item it removed from the queue, the producer can wake up another consumer thread to pick up the next piece of work to keep the queue moving.

Figure 1 Solution That Doesn’t Use Condition Variable

Producer

unsigned _stdcall ProducerFunc(void *pParam)
{
    for (unsigned int i = 0;i < g_uiIterations;i++)
    {
        EnterCriticalSection(&g_csLock);
        // Produce work
        g_listWork.push_back(i++);
        LeaveCriticalSection(&g_csLock);
        SetEvent(g_hNotEmpty);
        Sleep(g_uiProducerDelay); // Simulating work
    }
    return 0;
}

Consumer

while (true)
{
    EnterCriticalSection(&g_csLock);

    if (g_listWork.empty())
    {
        LeaveCriticalSection(&g_csLock);
        WaitForSingleObject(g_hNotEmpty,INFINITE);
    }
    else
    {
        i = g_listWork.front();
        g_listWork.pop_front();

        LeaveCriticalSection(&g_csLock);

        wcout << L"Thread " << iThread << L" Consumed: " << i << endl;
        Sleep(g_uiConsumerDelay); // Simulating work
    }
}

Although this solution will generally work, there are some limitations to this implementation. For one, getting the synchronization objects properly initialized can be tricky. For example, we had to decide whether the producer should wake only one consumer thread or all of them when data is pushed onto the list. This is controlled by how we initialized the event. If we use an auto-reset event, only one thread will be released. However, if we want to wake all the threads, we would use a manual-reset event and have to remember to call ResetEvent at the correct time.

It’s essential to make sure that when the queue is empty, the consumer thread releases the critical section before waiting on the not-empty event. We also have to make sure we don’t mistakenly use PulseEvent in the producer thread to signal the not-empty event because this would introduce a race condition. The problem occurs if the consumer thread is preempted immediately after it releases the critical section but before it calls WaitForSingleObject and then the producer thread calls PulseEvent. PulseEvent is not sticky and will only release those threads currently waiting on the event. When the preempted thread resumes, the event will not be signaled and the wakeup will be lost.

Condition variables make it much easier to get the solution right. This approach still uses the critical section, but it replaces the not-empty event with a condition variable. We initialize the condition variable in the main function by calling InitializeConditionVariable with a pointer to our CONDITION_VARIABLE structure.

The consumer thread shown in Figure 2 enters the critical section and checks to see if the queue is empty. If so, SleepConditionVariableCS is called. This function releases the critical section and puts the thread to sleep in one atomic operation, eliminating the race condition that can occur if the consumer is preempted in between these two tasks. The SleepConditionVariableCS function also takes a timeout parameter, allowing us to do something else if we don’t want to wait too long.

Figure 2 Solution That Uses Condition Variable

Producer


for (unsigned int i = 0;i < g_uiIterations;i++)
{
    EnterCriticalSection(&g_csLock);

    // Produce work
    g_listWork.push_back(i);

    LeaveCriticalSection(&g_csLock);

    WakeConditionVariable(&g_condNotEmpty);

    Sleep(g_uiProducerDelay);
}

Consumer

while (true)
{
    EnterCriticalSection(&g_csLock);

    while (g_listWork.empty())
    {
        SleepConditionVariableCS(&g_condNotEmpty,&g_csLock,INFINITE);
    }

    i = g_listWork.front();
    g_listWork.pop_front();

    LeaveCriticalSection(&g_csLock);

    wcout << L"Thread " << iThread << L" Consumed: " << i << endl;
    Sleep(g_uiConsumerDelay); // Simulating work
}

When the condition variable is signaled, the thread wakes up and automatically locks the critical section again before the SleepConditionVariableCS function returns. The consumer thread then loops back around and checks again to see if the queue is empty. It’s very important to test the condition again when the thread is released because its state could have been changed by another thread right before the consumer was released; in addition, the condition variable is subject to spurious wakeups so it may run before the condition has changed. We then remove the work item from the queue and release the critical section.

The producer thread shown in Figure 2 starts out by entering the critical section. Once it has the lock, the producer thread pushes a new work item onto the queue and then releases the critical section. Now it can release a consumer thread by calling WakeConditionVariable. This function releases only one thread, just like the solution that doesn’t use a condition variable. If we want to release all of the consumer threads, we would call the WakeAllConditionVariable function. The condition variable syntax makes it clear how the thread will be dispatched since the function names indicate what they will do. Using events makes this much more error-prone as the desired behavior is not specified with event-signaling functions but when the event is initialized.

If the producer thread is preempted right after it releases the critical section but before it calls the Wake function, there is an opening for another thread to modify the state of the queue before the consumer thread is released. This is why we said that the condition must be checked again by the consumer thread when it is released.

Condition variables can also be more efficient in some scenarios. The SleepConditionVariableCS and SleepConditionVariableSRW APIs try to avoid trips to kernel mode when possible. However, the example that doesn’t use condition variables will always incur at least one kernel-mode roundtrip when it calls WaitForSingleObject.

As a result, the native condition variable implementation provides many benefits over homegrown implementations without sacrificing flexibility. The resulting code is simpler and easier to read, and performance may be improved. This greatly reduces the probability of introducing many subtle bugs that can be very difficult to discover.

Slim Reader/Writer Lock

A reader/writer lock is used to protect a piece of data to which you want to allow multiple readers concurrent access, but in the case of an update, only exclusive access to the writer. Typically, these locks are best used in scenarios where the data needs to be read frequently and updated infrequently. Using reader/writer locks when appropriate should lead to increased scalability.

Reader/writer locks have been around for some time. Before the release of Windows Vista, however, if you were writing native user mode code on Windows and needed a reader/writer lock, the only choices you had were to write your own or to adapt an implementation from a textbook. Windows Vista includes a type of reader/writer lock called the slim reader/writer (SRW) lock. Let’s take a look at the feature set of this new synchronization primitive, examine the APIs that are available to use it, and compare its performance to an alternative implementation constructed using standard Win32® critical sections, semaphores, and events.

The SRW lock is designed to be fast and (as its name implies) small, while maintaining its resource efficiency as it is used. It is built on top of the Windows kernel keyed events mechanism, which was introduced to solve resource exhaustion problems that can occur when applications use large numbers of CRITICAL_SECTION objects. Some reader/writer locks are designed to favor readers over writers, and vice versa. But the SRW lock is designed to favor neither. This means that if your application requires that data updates take priority over data reads, you might want to consider a different reader/writer lock that favors writers. However, before you write your own lock, you should try out the SRW lock to see how it behaves in the context of your application.

Two other features that reader/writer locks sometimes support are the ability to acquire the lock recursively and the ability to upgrade (or downgrade) the access a thread has already been granted to lock. First, regarding recursive acquires: if the locking policy you’ve designed for your application requires that synchronization objects be acquired recursively, this is very possibly a red flag telling you to re-examine your locking policy to eliminate the recursion. This is our opinion and results from the additional overhead of executing the lock acquisition and release code multiple times and, perhaps more importantly, because ensuring that the balance of lock releases with the lock acquisitions is often difficult to prove correct.

The SRW lock doesn’t support recursive acquisition. Such support would impose additional overhead as per-thread counts would need to be tracked in order to maintain correctness. SRW locks also don’t support upgrading from shared to exclusive access, nor do they support downgrading (which is less common) from exclusive to shared access. Supporting the ability to upgrade would probably have caused unacceptable complexity and extra overhead that would impact even the common cases of shared and exclusive acquisition code in the lock. It would also require that policy be defined with respect to how to choose from waiting readers, waiting writers, and readers waiting to upgrade, which would violate the basic design goal of no favoritism.

The first step in using a slim reader/writer lock is to declare an SRWLOCK structure and initialize it using InitializeSRWLock:

VOID WINAPI InitializeSRWLock(PSRWLOCK SRWLock);

The SRW lock breaks away from the customary Win32 pattern where each object has an initialize and cleanup function. When you’re finished with an SRW lock, there is no cleanup function that needs to be called.

Once initialized, the basic APIs that are now available are:

VOID WINAPI AcquireSRWLockExclusive(PSRWLOCK SRWLock);

VOID WINAPI ReleaseSRWLockExclusive(PSRWLOCK SRWLock);

VOID WINAPI AcquireSRWLockShared(PSRWLOCK SRWLock);

VOID WINAPI ReleaseSRWLockShared(PSRWLOCK SRWLock);

The AcquireSRWLockExclusive function, as the name implies, acquires the lock exclusively for the caller. Once exclusive access is granted, all other threads requesting either type of access will be blocked until the lock is released using the complementary ReleaseSRWLockExclusive function. AcquireSRWLockShared, on the other hand, acquires the lock with shared access. In this case, other threads that also request shared access do not have to wait if the lock is un-owned or has already been acquired by another thread with shared access.

Slim reader/writer locks can be used with condition variables using the SleepConditionVariableSRW function:

BOOL WINAPI SleepConditionVariableSRW(
  PCONDITION_VARIABLE ConditionVariable,
  PSRWLOCK SRWLock,
  DWORD dwMilliseconds,
  ULONG Flags
);

SleepConditionVariableSRW releases an SRW lock and waits on the specified condition variable as one atomic operation. The function doesn’t return until the condition variable has been signaled or the timeout specified in dwMilliseconds has elapsed. If dwMilliseconds is INFINITE, the function will never timeout. If the Flags parameter specifies CONDITION_VARIABLE_LOCKMODE_SHARED, the function assumes that the SRW lock is held with shared access; otherwise exclusive access is assumed and upon successful return the lock will have been re-acquired with the specified access.

While experimenting with these APIs, we discovered some issues that warrant extra care when coding. Notice that none of the lock acquisition or release APIs are typed as returning a result. If the SRW lock is not currently owned by a thread and a call to either ReleaseSRWLockShared or ReleaseSRWLockExclusive is made, a STATUS_RESOURCE_NOT_OWNED structured exception will be thrown. This is nice because the error is surfaced in a manner that is obvious to the developer.

Earlier we mentioned that recursive acquisition of the lock is not supported. The result of an attempt to recursively acquire exclusive access again is that the second call to AcquireSRWExclusive will never return as it will be self-deadlocked with the thread remaining blocked inside, waiting for the lock to be released. You would probably need to attach a debugger to the process to see the condition and figure out what’s going on. Calling AcquireSRWShared from a thread that has already acquired shared access will also lead to a deadlock if another thread has attempted to acquire the lock exclusively in between AcquireSRWShared calls.

Extra care should also be taken in making sure that the correct acquire and release function pairs are used—mismatching AcquireSRWLockExclusive with ReleaseSRWLockShared (or vice versa) doesn’t cause any exceptions to be raised. It would have been helpful if exceptions would have been raised for these error scenarios, but detecting the errors may have imposed unwanted resource and performance overhead.

The sample code for this article includes the source for a program called ReaderWriterExample, which allows for experimentation with different reader/writer lock implementations. The lock types that the program supports are the SRW lock, two variants of a reader/writer lock that we implemented, and a critical section. The custom reader/writer locks were both built using a Win32 critical section to protect the lock’s data, an event for signaling waiting readers, and a semaphore for waiting writers. The difference between the two custom locks is that one favors writers and one favors neither readers nor writers.

All of our anecdotal testing was done using a 64-bit version of Windows Vista on a system with two dual-core Intel Xeon 3.20 GHz processors. For these tests, the system was configured with hyperthreading disabled.

Test 1 consisted of 2,000,000 iterations of each thread where the work done while the locks are held is minimal. Therefore, the results reflect the actual cost of lock acquisition and release. The results are summarized in Figure 3.

Figure 3 Testing 2,000,000 Iterations (results measured in seconds)

Lock Type 1 Reader 1 Writer 2 Readers/ 2 Writers 3 Readers/ 1 Writer 4 Readers 4 Writers
SRW Lock 0.126 0.119 0.589 0.667 0.871 0.517
Custom R/W Lock - Favors Neither 1.238 0.257 27.095 4.076 2.466 53.567
Custom R/W Lock - Favors Writers 1.228 0.260 31.306 6.083 2.307 53.567
Critical Section 0.134 0.133 1.084 1.021 1.036 1.009

There are some interesting findings here. First, when using just one reader thread or one writer thread (the first two columns of results), there is no lock contention and the performance of the SRW lock is very close to that of the critical section. In the case of four writer threads, all competing for exclusive access to the lock, the SRW lock takes only about half the time the critical section approach takes. The SRW lock used as an exclusive mode lock appears to perform better than the critical section and may be worth considering as a substitute. Notice that in the columns labeled 2 Readers/2 Writers and 3 Readers/1 Writer, the SRW lock performs significantly faster than when using a critical section. This demonstrates the benefit of a reader/writer lock allowing data readers to work in parallel.

What can we say about our homegrown reader/writer locks? Their performance looks terrible compared to the two built-in Windows locks, especially when operating under a heavy load of mixed contention from readers and writers. However, take a look at the results for 2 Readers/2 Writers and 3 Readers/1 Writer. Notice that the policy of the lock does have an effect on the overall behavior and performance. The lock that favored writers performed slower than the lock that favored neither readers nor writers. This is because parallelism was reduced to give writers preference when they were waiting to update. Of course, the goal is often to make sure the readers see the most current data and so using a lock that favors writers is the correct choice in that case.

Another issue with the custom-built locks is exhibited when four writers are competing for the lock. Why is the performance so terrible when there’s a lot of contention for exclusive access? The problem is that each thread is competing for access to the critical section that protects the internal data. Contention on the critical section causes a trip to kernel mode, where the thread is put to sleep on the event. Once the critical section has been entered, the state of the lock is examined to see if any readers are holding shared access or if another writer already has exclusive access. In either of these cases, the thread that wants exclusive access has to leave the critical section and then wait on the semaphore, which will be signaled when the thread has exclusive access and can continue to run. As you can see, frequently having to wait on multiple synchronization objects in order to get exclusive access really hurts the performance.

The ReaderWriterExample application includes a number of command-line switches so that its behavior can be tailored to allow for experimentation with different scenarios using the different types of reader/writer locks. The data that is protected by the lock is a simple LONG type. However, the application allows for the specification of an additional amount of work to be done while holding the lock in order to simulate accessing or updating a more complex data structure. The additional work parameter can be specified individually for readers and writers, allowing for the simulation of data structures that incur more overhead from updates than from reads. There are also parameters to specify how much work a reader/writer should do in between each access to the lock. Work is simulated by spinning in a loop performing a calculation.

We already know that the homegrown locks don’t perform well when they are under heavy load with lots of contention from both readers and writers. Furthermore, the SRW lock performed well in all of the cases in the first example. Does this mean you’ll never want to build your own reader/writer locks now that SRW locks are available? Not necessarily. Let’s look at what happens when the locks are used in scenarios that are more typical of how we expect the locks to be used in an application.

In these scenarios, the number of shared accesses by a reader is 1,000,000 while the number of exclusive accesses by the writers is 150,000. This keeps with what we said earlier—that reader/writer locks make sense when the ratio of shared accesses to exclusive updates is high. Additionally, readers hold the lock for 2000 units of work to simulate a read request, while writers hold the lock for 3000 work units to simulate the added expense of an update to the data. A reader thread performs 100 units of work between each access to the lock so that the spacing between accesses is short while the writer performs 10,000 work units before attempting exclusive access to cause a longer amount of time between updates. This reduces the overall contention on the locks.

We ran this test using 2 readers/2 writers and 3 readers/1 writer. The results are summarized in Figures 4 and 5. The numbers in the tables are the elapsed time, measured in seconds, needed for each thread to complete its work. For reader threads, the second number is the number of updates that the reader thread observed to the data.

Figure 5 Testing Higher Reads to Updates in a 3 Readers/1 Writer Scenario (results measured in seconds)

3 Readers / 1 Writer Scenario Slim R/W Critical Section Custom R/W Lock Favors Neither Custom R/W Lock Favors Writers
Reader Thread 1 2.345 (761 updates) 7.650 (21,266 updates) 6.313 (97,997 updates) 8.365 (132,290 updates)
Reader Thread 2 2.282 (631 updates) 7.486 (11,466 updates) 6.267 (102,102 updates) 8.170 (140,144 updates)
Reader Thread 3 2.324 (633 updates) 7.581 (20,013 updates) 6.321 (98,100 updates) 8.134 (126,568 updates)
Writer Thread 1 7.970 11.990 8.010 8.446
Total Execution Time 7.970 11.990 8.010 8.446

Figure 4 Testing Higher Reads to Updates in a 2 Readers/2 Writers Scenario (results measured in seconds)

2 Reader / 2 Writers Scenario Slim R/W Critical Section Custom R/W Lock Favors Neither Custom R/W Lock Favors Writers
Reader Thread 1 1.892 (1,185 updates) 5.044 (19,689 updates) 1.868 (7,728 updates) 5.664 (133,215 updates)
Reader Thread 2 1.920 (1,177 updates) 3.906 (16,223 updates) 1.861 (7,402 updates) 5.559 (139,283 updates)
Writer Thread 1 7.575 9.996 7.372 7.321
Writer Thread 2 7.574 10.250 7.378 7.317
Total Execution Time 7.575 10.250 7.378 7.321

These results show more parity between our homegrown locks and the SRW lock. Notice how the different policies affect the results. The lock that favors writers causes the readers to wait while writers are allowed access, which means that updates were given higher priority and in turn allowed the readers to see more of the updates. Prioritizing updates can be vital in an application so understanding the policy of the lock and its performance characteristics under the expected load is an important consideration.

SRW locks are a great new synchronization primitive on the Windows platform. For the first time, Windows offers a built-in reader/writer lock for native systems programmers. It performs well under many different circumstances and should be the first reader/writer lock you use. There will be times when your application will benefit from a different lock policy but, as we demonstrated, building a custom reader/writer lock that performs well in a wide range of scenarios isn’t straightforward.

One-Time Initialization

The problem of how to ensure correct initialization of an object or resource that is to be shared by multiple threads is one that arises time and time again in building multithreaded systems. C and C++ offer no help in solving this problem because the language standard makes no mention of multithreading support. Consider an example where there’s to be one instance of a logger object used for logging messages and where one of the requirements is that the object be instantiated on demand instead of created at the start of program execution. The single-threaded initialization code shown in Figure 6 will not execute correctly in the presence of multiple threads simultaneously executing inside the GetLogger function to access the logger object in our system.

Figure 6 Initialization

Single-Threaded Initialization

Logger* GetLogger()
{
    // C++ will execute the Logger’s constructor
    // only the first time this method is called.
    static Logger logger;

    return &logger;
}

Thread-Safe Initialization

Logger* GetLogger()
{
    static Logger* pLogger = 0;

    EnterCriticalSection(&cs);

    if (pLogger == 0)
    {
        try
        {
            pLogger = new Logger();
        }
        catch (...)
        {
            // Something went wrong.
        }
    }

    LeaveCriticalSection(&cs);

    return pLogger;
}

The initialization of the static Logger object may actually occur more than once because the compiler doesn’t place any synchronization around its construction. This may result in a corrupted logger object that in a best-case scenario will generate an exception when used—but there’s no guarantee the failure will be obvious and the system may run for a long time before anyone notices there’s a problem. One way to fix this problem is to rewrite the function and introduce synchronization so that it works correctly in the presence of multiple threads, as shown by the thread-safe initialization code in Figure 6.

Now each thread that enters the GetLogger function will attempt to enter the critical section, which means that only one thread at a time will be allowed inside the protected code block. A thread that’s executing in the critical section checks the value of pLogger and only if it is NULL will an instance of the Logger object be created. This will occur only once when the first thread enters the critical section. All other threads that subsequently enter will find that pLogger is not NULL and exit the critical section without doing any further work. By the time the return statement is reached, the value of pLogger will be non-NULL and can be returned to the caller.

On the surface, the thread safe initialization code seems like a reasonable solution. However, there is a penalty to be paid here, particularly if many threads end up calling the GetLogger function at the same time. Once the first thread has made it through allocating, constructing, and setting the pLogger pointer, there really isn’t any need for subsequent threads to enter the critical section object anymore. The pointer is always going to be valid. This realization has led to a design pattern in C++ programming called the double-checked locking pattern.

Figure 7 shows an implementation of GetLogger that uses the double-checked locking pattern. The pattern dictates that the pLogger variable is tested up to two times. The first time it checks the variable, it is done outside the critical section. If it finds that pLogger is NULL, the thread will then enter the critical section. Once inside, it will once again test pLogger against NULL before instantiating and setting the pLogger variable since the thread may have been blocked when waiting on the critical section due to a race with another thread.

Figure 7 Double-Checked Locking

Logger* GetLogger()
{
    volatile static Logger* pLogger = 0;

    if (pLogger == NULL)
    {
        EnterCriticalSection(&cs);

        if (pLogger == NULL)
        {
            try
            {
                pLogger = new Logger();
            }
            catch (...)
            {
                // Something went wrong.
            }
        }

        LeaveCriticalSection(&cs);
    }

    return pLogger;
}

It appears that the double-checked locking version offers the best of all worlds. Only the few threads that enter GetLogger before the object is instantiated will be forced to be synchronized; those that come after won’t have to enter the critical section at all. What more could we want?

The double-checked locking pattern, while conceptually simple, has up until now proven to be very difficult to code correctly. This is because the C++ standard doesn’t define a threading model. It assumes only one thread of execution, and it doesn’t define a way for a developer to express constraints on relative instruction ordering, which gives the compiler latitude to reorder the reads and writes to memory. In a multithreaded environment, the reordering can cause a thread to observe a write to memory before all the statements that occur before it in the source code have actually executed. In the case of the double-checked locking code, it is possible that the pLogger variable will be updated with the address of the memory location allocated for the Logger object before the Logger constructor has executed. A second thread that observed the non-NULL value would never enter the critical section and return the address of an object that isn’t fully constructed. (You can read about similar issues in managed code in Vance Morrison’s article "Understand the Impact of Low-Lock Techniques in Multithreaded Apps".) In past releases of the Visual Studio® C++ compiler, even using the volatile qualifier on the variable pLogger wasn’t enough to ensure correct behavior in a multithreaded scenario. However, with the release of Visual Studio 2005, the double-checked locking pattern can be made to reliably execute on the Windows platform as long as the variable pLogger is qualified with the keyword volatile.

Even though Visual Studio 2005 makes it possible to correctly implement the double-checked locking pattern, it’s still very easy to get the implementation wrong simply by omitting the volatile qualifier on the variable holding the instance pointer or omitting the check inside the critical region; the program will still compile and appear to work. Having a mechanism that just works would be a valuable addition to C++. But rather than waiting for a new revision from the standards body, Windows Vista provides just such a facility called one-time initialization. No matter what hardware platform you’re using Windows on, this is guaranteed to work. One-time initialization allows for both synchronous and asynchronous initialization. First we’ll take a look at synchronous initialization.

Conceptually, synchronous initialization works the same way the double-checked locking pattern works. Of the first n threads that attempt to simultaneously execute initialization, only one will actually instantiate the resource—the others remain blocked until initialization is completed. Once initialization has completed, subsequent threads that attempt to access the resource will not be blocked; the stored resource will simply be returned.

The first thing that needs to be done is to declare and initialize an instance of the INIT_ONCE structure. You use the InitOnceInitialize function to perform the initialization. This must be done for both synchronous and asynchronous one-time initialization, and it must be done before the structure is used with any other one-time initialization function. Its definition is:

VOID WINAPI InitOnceInitialize(
  PINIT_ONCE InitOnce
);

The following function, meanwhile, is used to perform synchronous one-time initialization:

BOOL WINAPI InitOnceExecuteOnce(
  PINIT_ONCE InitOnce,
  PINIT_ONCE_FN InitFn,
  PVOID Parameter,
  LPVOID* Context
);

The first parameter, InitOnce, is a pointer to the instance of the INIT_ONCE structure. All threads that pass the same address of an INIT_STRUCTURE will be synchronized with respect to each other while initialization is occurring. The second parameter, InitFn, is a pointer to a function that the developer writes; this function performs the actual initialization. The third parameter, handily named Parameter, is an optional value that will be passed back to the InitFn when it is called. Here you can specify any information needed to perform the initialization. The final parameter, Context, is a pointer to a void pointer. This is where the address of the initialized object will be stored if the function succeeds. Note that I’m using the word "object" to describe the result of the one-time initialization. It doesn’t necessarily have to be a pointer to a C++ object. This can be anything that fits into a pointer-size value on the platform upon which execution is taking place. InitOnceExecuteOnce returns TRUE if it succeeds, FALSE if it does not.

Now we’ll look at the function pointed to by InitFn. The definition of the function has the following signature:

BOOL CALLBACK InitOnceCallback(
  PINIT_ONCE InitOnce,
  PVOID Parameter,
  PVOID* Context
);

The parameters are the same as those in the InitOnceExecuteOnce function. The InitOnceCallback function is expected to store the address of the initialized object in Context. Figure 8 contains the GetLogger function rewritten to use synchronous one-time initialization. Notice that the callback function indicates failure by returning FALSE. By returning FALSE from the callback, the system allows another thread that attempts initialization to execute the callback function in hopes that it will be successful. This continues until either there are no more initialization attempts or the callback returns TRUE.

Figure 8 Synchronous One-Time Initialization

BOOL WINAPI InitLoggerFunction(PINIT_ONCE intOncePtr,
                               PVOID Parameter,
                               PVOID* contextPtr)
{
    try
    {
        Logger* pLogger = new Logger();
        *contextPtr = pLogger;
        return TRUE;
    }
    catch (...)
    {
        // Something went wrong.
        return FALSE;
    }
}

Logger* GetLogger()
{
    static INIT_ONCE initOnce;

    PVOID contextPtr;
    BOOL status;

    status = InitOnceExecuteOnce(&initOnce,
                                 InitLoggerFunction,
                                 NULL,
                                 &contextPtr);

    if (status)
    {
        return (Logger*)contextPtr;
    }

    return NULL;
}

Asynchronous one-time initialization is somewhat more complicated than the synchronous version, but it allows for one-time initialization without causing any of the threads performing the initialization to be blocked while waiting for initialization to complete. Thus, until initialization has completed, all threads that attempt initialization are allowed to concurrently execute and race through the initialization function where each initializes its own private copy of the object followed by an attempt to register the object as the one and only initialized object. Only one thread will be the so-called winner of the object registration and the remaining threads (the losers) must destroy their private object instances and then make a query to the system for the winning object.

The two APIs that are used for asynchronous one-time initialization are:

BOOL WINAPI InitOnceBeginInitialize(
  LPINIT_ONCE InitOnce,
  DWORD dwFlags,
  PBOOL fPending,
  LPVOID* Context
);

BOOL WINAPI InitOnceComplete(
  LPINIT_ONCE lpInitOnce,
  DWORD dwFlags,
  LPVOID lpContext
);

Figure 9 shows the same GetLogger function that was used above, but now it has been rewritten to use asynchronous one-time initialization. We’ll go through this routine step by step. The first thing a thread does is call InitOnceBeginInitialize. The thread must supply the pointer to the INIT_ONCE structure to be used, and the dwFlags parameter should be set to INIT_ONCE_ASYNC to specify that this thread is attempting to begin asynchronous one-time initialization. The state of initialization is communicated back to the caller using the remaining two parameters, fPending and Context, which are both specified as pointers so the function can update them. InitOnceBeginInitialize returns TRUE if it is successful; otherwise it returns FALSE, indicating that something went wrong and initialization cannot proceed. The fPending parameter must be examined to determine whether another thread has already completed the initialization. If fPending is FALSE, initialization has already completed and the initialized object has been stored in the Context parameter. In this case, the only work that remains is for GetLogger to convert Context to a Logger* and return it to the caller.

Figure 9 Asynchronous One-Time Initialization

Logger* GetLogger()
{
    static INIT_ONCE initOnce;

    PVOID contextPtr;
    BOOL    fStatus;
    BOOL    fPending;

    fStatus = InitOnceBeginInitialize(&initOnce,INIT_ONCE_ASYNC, 
        &fPending, &contextPtr);

    // Failed?
    if (!fStatus)
    {
        return NULL;
    }

    // Initialization already completed?
    if (!fPending)
    {
        // Pointer to the logger is contained context pointer.
        return (Logger*)contextPtr;
    }

    // Reaching this point means that the logger needs to be created.

    try
    {
        Logger* pLogger = new Logger();

        fStatus = InitOnceComplete(
            &initOnce,INIT_ONCE_ASYNC,(PVOID)pLogger);

        if (fStatus)
        {
            // The Logger that was created was successfully 
            // set as the logger instance to be used.
            return pLogger;
        }

        // Reaching this point means fStatus is FALSE and the object this
        // thread created is not the sole instance. 
        // Clean up the object this thread created.
        delete pLogger;
    }
    catch (...) 
    {
        // Instantiating the logger failed.
        // Fall through and see if any of 
        // the other threads were successful.
    }

    // Call again but this time ask only for 
    // the result of one-time initialization.
    fStatus = InitOnceBeginInitialize(&initOnce,INIT_ONCE_CHECK_ONLY, 
        &fPending,contextPtr);

    // Function succeeded and initialization is complete.
    if (fStatus)
    {
        // The pointer to the logger is in the contextPtr.
        return (Logger*) contextPtr;
    }

    // InitOnceBeginInitialize failed, return NULL.
    return NULL;
}

Things get more interesting when fPending comes back with a value of TRUE. This indicates that initialization should proceed and that there may be other threads also running the initialization code at the same time as the caller’s thread. In Figure 9, this results in the creation of a new Logger object followed by an attempt to set the instance as the one and only instance result from the asynchronous one-time initialization. This can be seen in the InitOnceComplete call that occurs after the new statement. The first parameter specifies the pointer to the same INIT_ONCE structure that was used previously, the INIT_ONCE_ASYNC flag is passed as the second parameter, and the pointer to the logger instance is supplied in the third parameter. If InitOnceComplete returns TRUE, then this instance is the one that will be returned to all callers of GetLogger.

Should InitOnceComplete return FALSE, a logger created by a different thread was stored and the caller should destroy the instance it created so that any resources it consumes are not stranded. After this, InitOnceBeginInitialize is called again, but this time instead of using the INIT_ONCE_ASYNC flag, the INIT_ONCE_CHECK_ONLY flag is used. The INIT_ONCE_CHECK_ONLY flag is used to check to see if initialization has completed. If it has, the initialization value that was stored is copied into the supplied Context pointer parameter. If a valid initialization value is returned, the function will return TRUE and the fPending parameter will be set to FALSE.

So how do you choose between synchronous and asynchronous one-time initialization? In a uniprocessor system, it makes sense to use synchronous since only one thread can be executing at a time. In a multiprocessor or multicore system, you’ll need to try to quantify the costs that may be incurred if multiple object instances are created. Costs can include time, memory (or other resource types that may be scarce), and the number of threads that will typically be concurrently executing the initialization code before an object has been successfully initialized. Synchronous initialization might be a better choice if the object takes more than a negligible amount of time to construct or if it uses significant amounts of memory either per object or aggregated across all objects that might be simultaneously created. Additionally, if the process of object creation can cause the threads performing the initialization to block, then synchronous initialization is the better choice since the opportunity for concurrency is then diminished and little would likely be gained from asynchronous initialization.

The source code for this article includes a sample, called OneTimeInit.exe, that demonstrates both synchronous and asynchronous one-time initialization of a shared event handle. The program takes the number of threads to run and a flag indicating whether to perform synchronous or asynchronous initialization. The specified number of threads are spawned and each thread attempts to retrieve a handle to a Win32 event object used to signal when the thread should terminate. The progress of each thread is sent to stdout, allowing you to see exactly how the initialization progresses. Each thread does some contrived work until the terminate event is signaled by the main thread. To get program usage information, simply run the program without any parameters.

There are a couple of loose ends that need to be mentioned with regard to one-time initialization. We noted that with synchronous initialization the callback function will continue to be called until it succeeds. Suppose your application scenario dictates that the initialization should only be attempted once and never again if it fails. One way to do this is to have the callback function still return TRUE but also set a special value in lpContext that other threads can check for as an indication that a failure occurred. Another mechanism is to use the InitiOnceBeginInitialize/InitOnceComplete APIs for synchronous initialization instead of the InitOnceExecuteOnce API. To do this, omit the INIT_ONCE_ASYNC flag from the dwFlags parameter and place the initialization code between the calls to InitOnceBeginInitialize and the InitOnceComplete API, rather than in a separate callback function. In the case of failure, the INIT_ONCE_INIT_FAILED flag is used with the InitOnceComplete API to indicate that no further initialization attempts should occur. An optional failure value can be set using lpContext, which other threads can check for.

The second loose end is that one-time initialization doesn’t help with cleanup of the object that was created. There are a number of ways to deal with this and no one solution fits every situation. If the object is something that the operating system cleans up after the process terminates, consider leaking it. But beware that this can show up as an issue if you use tools that monitor your programs for resource leaks. In cases where the object needs to have its cleanup function called to ensure proper behavior, you’re left with little choice but to coordinate the call of the function after all threads that are using the object have terminated. However, you may be able to use a technique similar to the one used in the sample. The sample uses a static object that the initialized HANDLE is registered with so that it can be automatically destroyed at the end of the program execution after all the threads have terminated.

Conclusion

In this article, we’ve shown you that there are a lot of important enhancements to the thread synchronization primitives in Windows Vista for native C/C++ developers. These enhancements allow developers to more easily address thread synchronization problems. However, this article has only scratched the surface. We encourage you to investigate these and other changes further. For starters, check out the thread pool API enhancements. Unfortunately, we don’t have space to discuss the thread pool API enhancements here, but they are a valuable addition worth careful consideration.

We’d like to thank Neill Clift and Arun Kishan for answering our questions and providing insightful feedback. In addition, we’d also like to thank Jerry Albro for reviewing the content and providing us with valuable feedback.

Robert Saccone is an Architect in the Forefront Server Security group. His areas of interest are large-scale software design, distributed systems, and operating systems implementations.

Alexander Taskov is a Senior Software Development Engineer on the Forefront Server Security team at Microsoft. He is responsible for the design and development of security features on the Forefront Server Security product.