February 2009

Volume 24 Number 02

Windows With C++ - Visual C++ 2010 And The Parallel Patterns Library

By Kenny Kerr | February 2009

This column is based on a prerelease version of Visual Studio 2010. All information is subject to change.

Contents

Language Enhancements
Parallel Algorithms
Tasks and Task Groups

Visual C++ is getting a major upgrade in the 2010 release of Visual Studio. Many of the new language and library features are designed purely to make it easier and more natural to express your desires in code. But as has always been the case with C++, the combination of these features is what makes C++ such a powerful and expressive language.

So this month I am going to introduce some of the additions to the C++ language that Visual C++ has added as part of the forthcoming C++0x standard. I'll then look at the Parallel Patterns Library(PPL) that Microsoft has developed over and above the C++0x standard to introduce parallelism to your applications in a manner that naturally complements the standard C++ library.

Language Enhancements

In the May 2008 article " C++ Plus: Beef Up Windows Apps with the Visual C++ 2008 Feature Pack," I introduced the additions to the Standard C++ Library as part of Technical Report 1 (TR1) that was originally introduced with the Visual C++ 2008 Feature Pack and is now included with Visual Studio 2008 SP1. In that article, I demonstrated the support for function objects through the function template class and the bind template function. Being able to treat functions polymorphically solved many of the awkward problems C++ developers often ran into when writing or using generic algorithms.

As a recap, here's an example of how you initialize a function object with the standard plus algorithm:

function<int (int, int)> f = plus<int>(); int result = f(4, 5);

With the help of the bind function, you can transform a function that provides the necessary functionality but doesn't quite have the right signature.

In the following example, I'm using the bind function with placeholders to initialize the function object with a member function:

    struct Adder { int Add(int x, int y, void* /*reserved*/) { return x + y; } }; Adder adder; function<int (int, int)> f = bind(&Adder::Add, &adder, _1, _2, static_cast<void*>(0)); int
    result = f(4, 5);

There are two problems that arise with the use of these library additions that cannot be easily overcome without language enhancements. For starters, it's often inefficient to explicitly define a function object, as it adds certain overhead that the compiler could otherwise have avoided. It can also be quite redundant and tedious to re-declare the function prototype when the compiler clearly knows the signature that best matches the initialization expression.

This is where the new auto keyword helps you. It can be used in lieu of explicitly defining the type of a variable, which is useful in template metaprogramming where the specific types are either hard to define or complicated to express. Here's what it looks like:

auto f = plus<int>();

The definitions of the functions themselves can also benefit from some improvement. Often you can reuse helpful algorithms, such as those in the Standard C++ Library. More often than not, however, you need to write some domain-specific function for a very specific purpose that is not generally reusable.

But because the function must be defined elsewhere, you must consider both logical and physical design. Wouldn't it be nice if the definition could be placed in the very location where it is needed, simplifying the understanding of the code by improving the locality of the logic and improving the encapsulation of the overall design? The addition of lambda expressions allows exactly that:

auto f = [](int x, int y) { return x + y; };

Lambda expressions define unnamed function objects, sometimes called closures. The [] is the hint that tells the compiler that a lambda expression is beginning. This is known as the lambda introducer, and it is followed by a parameter list. This parameter declaration can also include a return type, although it is often omitted when the compiler can deduce the type unambiguously, as is the case in the previous code snippet. In fact, the parameter list itself can be omitted if the function is to accept no parameters. The lambda expression concludes with any number of C++ statements within braces.

Lambda expressions can also capture variables for use within the function from the scope in which the lambda expression is defined. Here's an example that uses a lambda expression to calculate the sum of the values in a container:

int sum = 0; for_each(values.begin(), values.end(), [&sum](int value) { sum += value; });

Granted, this could have been performed more succinctly with the accumulate function, but that would be missing the point. This example demonstrates how the sum variable is captured by reference and used within the function.

Parallel Algorithms

The PPL introduces a set of task-oriented parallelism constructs as well as a number of parallel algorithms similar to what is available with OpenMP today. The PPL algorithms are, however, written using C++ templates rather than pragma directives and, as a result, are far more expressive and flexible. The PPL is, however, fundamentally different from OpenMP in the sense that the PPL promotes a set of primitives and algorithms that are more composable and reusable as a set of patterns. Meanwhile, OpenMP is inherently more declarative and explicit in matters such as scheduling and ultimately is not part of C++ proper. The PPL is also built on top of the Concurrency Runtime, allowing for greater potential interoperability with other libraries based on the same runtime. Let's look at the PPL algorithms and then see how you can use the underlying functionality directly for task-oriented parallelism.

In the October 2008 installment of this column (" Exploring High-Performance Algorithms"), I demonstrated the benefit of efficient algorithms and the effects of locality and cache-conscious designs on performance. I showed how an inefficient, single-threaded algorithm for converting a large image to grayscale took 46 seconds, while an efficient implementation still only using a single thread took just 2 seconds. With a little sprinkling of OpenMP I was able to parallelize the algorithm over the Y axis and reduce the time even further. Figure 1shows the code for the OpenMP algorithm.

Figure 1 Grayscale Algorithm Using OpenMP

    struct Pixel { BYTE Blue; BYTE Green; BYTE Red; BYTE Alpha; }; void MakeGrayscale(Pixel& pixel) { const BYTE scale = static_cast<BYTE>(0.30 * pixel.Red + 0.59 *
    pixel.Green + 0.11 * pixel.Blue); pixel.Red = scale; pixel.Green = scale; pixel.Blue = scale; } void MakeGrayscale(BYTE* bitmap, const int width, const int height, const int stride) { #pragma omp parallel for for (int y = 0; y < height; ++y) { for (int x = 0; x < width; ++x) { const int offset = x * sizeof(Pixel) + y * stride; Pixel& pixel = *reinterpret_cast<Pixel*>(bitmap + offset); MakeGrayscale(pixel); } } }

The PPL includes a parallel for algorithms that can very naturally, with a little help from lambda expressions, replace the use of OpenMP in the MakeGrayscale function from Figure 1:

parallel_for(0, height, [&] (int y) { for (int x = 0; x < width; ++x) { // omitted for brevity } });

As you can see, the for loop as well as the OpenMP pragma have been replaced with the parallel_for function. The function's first two parameters define the range of iteration, just like for the previous for loop. Unlike OpenMP, which places heavy restrictions on the for directive, parallel_for is a template function so you can, for example, iterate over unsigned types or even complex iterators from the standard containers. The last parameter is a function object, which I define as a lambda expression.

You'll notice that the lambda introducer includes only an ampersand without explicitly declaring any variables to capture. This tells the compiler to capture all possible variables by reference. Since the statements within the lambda expression use a number of variables, I used this as a shorthand. Be careful, however, because the compiler is unable to optimize away all the unused variables, resulting in poor runtime performance. I could have explicitly captured the variables I needed with the following capture list:

[&bitmap, width, stride]

Just as the parallel_for function is a parallel alternative to the for loop providing parallel iteration over a range of indices, the PPL also provides the parallel_for_each template function as a parallel alternative to the standard for_each function. It provides parallel iteration over a range of elements defined by a pair of iterators, such as those provided by the standard containers. Although it made more sense for the previous example to use the parallel_for function with explicit indices, it is often more natural to use iterators to define a range of elements. Given an array of numbers, you could square their values using the parallel_for function as follows:

array<int, 5> values = { 1, 2, 3, 4, 5 }; parallel_for(0U, values.size(), [&values] (size_t i) { values[i] *= 2; });

But this approach is overly verbose, requires the array itself to be captured by the lambda expression, and, depending on the type of container, may be inefficient. The parallel_for_each function solves these problems nicely:

parallel_for_each(values.begin(), values.end(), [] (int& value) { value *= 2; });

If you just want to run a number of functions in parallel, or as parallel as is possible based on the number of hardware threads available, you can use the parallel_invoke template function. There are overloads available to accept anywhere from 2 to 10 function objects. Here's an example of running 3 lambda functions in parallel:

    combinable<int> sum; parallel_invoke([&] { sum.local() += 1; }, [&] { sum.local() += 2; }, [&] { sum.local() += 3; }); int result = sum.combine([] (int left, int right) { return left +
    right; }); ASSERT(6 == result);

This example also illustrates another helper class provided by the PPL. The combinable class makes it incredibly easy to combine the results of a number of parallel tasks with minimal locking. By providing a local copy of the value for each thread and then only combining the results of each thread after the parallel work has concluded, the combinable class avoids much of the locking that normally occurs in such cases.

Tasks and Task Groups

The actual parallelism in the algorithms I've discussed is achieved by means of a simple task-oriented API that you are free to use directly. Tasks are defined with the task_handle class initialized with a function object. Tasks are grouped together with a task_group class that runs the tasks and waits for them to complete. Of course, the task_group provides helpful overloads so that in many cases you don't even have to define the task_handle objects yourself and can let the task_group object allocate and manage its lifetimes for you. Here's an example of how to use a task_group to replace the parallel_invoke function from the previous example:

task_group group; group.run([&] { sum.local() += 1; }); group.run([&] { sum.local() += 2; }); group.run([&] { sum.local() += 3; }); group.wait();

In addition to the algorithms and APIs I've discussed here, other parallel algorithms and helper classes may well be included when the Parallel Patterns Library is finally released with Visual C++ 2010. To keep up to date on the latest in concurrency, visit Parallel Programming in Native Code.

Send your questions and comments to mmwincpp@microsoft.com.

Kenny Kerr is a software craftsman specializing in software development for Windows. He has a passion for writing and teaching developers about programming and software design. Reach Kenny at weblogs.asp.net/kennykerr.