Parallel Algorithms
The Parallel Patterns Library (PPL) provides algorithms that concurrently perform work on collections of data. These algorithms resemble those provided by the Standard Template Library (STL).
The parallel algorithms are composed from existing functionality in the Concurrency Runtime. For example, the Concurrency::parallel_for algorithm uses a Concurrency::structured_task_group object to perform the parallel loop iterations. The parallel_for algorithm partitions work in an optimal way given the available number of computing resources.
Sections
This topic describes the following parallel algorithms in detail:
parallel_for Algorithm
parallel_for_each Algorithm
parallel_invoke Algorithm
parallel_for Algorithm
The Concurrency::parallel_for algorithm repeatedly performs the same task in parallel. Each of these tasks is parameterized by an iteration value. This algorithm is useful when you have a loop body that does not share resources among iterations of that loop.
The parallel_for algorithm partitions tasks in an optimum way for parallel execution. It uses a work-stealing algorithm to balance these partitions when workloads are unbalanced. When one loop iteration blocks cooperatively, the runtime redistributes the range of iterations that is assigned to the current thread to other threads or processors. Similarly, when a thread completes a range of iterations, the runtime redistributes work from other threads to that thread. The parallel_for algorithm also supports nested parallelism. When one parallel loop contains another parallel loop, the runtime coordinates processing resources between the loop bodies in an efficient way for parallel execution.
The parallel_for algorithm has two overloaded versions. The first version takes a start value, an end value, and a work function (a lambda expression, function object, or function pointer). The second version takes a start value, an end value, a value by which to step, and a work function. The first version of this function uses 1 as the step value.
You can convert many for loops to use parallel_for. However, the parallel_for algorithm differs from the for statement in the following ways:
The parallel_for algorithm parallel_for does not execute the tasks in a pre-determined order.
The parallel_for algorithm does not support arbitrary termination conditions. The parallel_for algorithm stops when the current value of the iteration variable is one less than _Last.
The _Index_type type parameter must be an integral type. This integral type can be signed or unsigned.
The loop iteration must be forward. The parallel_for algorithm throws an exception of type std::invalid_argument if the _Step parameter is less than 1.
The exception-handling mechanism for the parallel_for algorithm differs from that of a for loop. If multiple exceptions occur simultaneously in a parallel loop body, the runtime propagates only one of the exceptions to the thread that called parallel_for. In addition, when one loop iteration throws an exception, the runtime does not immediately stop the overall loop. Instead, the loop is placed in the cancelled state and the runtime discards any tasks that have not yet started. For more information about exception-handling and parallel algorithms, see Exception Handling in the Concurrency Runtime.
Although the parallel_for algorithm does not support arbitrary termination conditions, you can use cancellation to stop all tasks. For more information about cancellation, see Cancellation in the PPL.
Note
The scheduling cost that results from load balancing and support for features such as cancellation might not overcome the benefits of executing the loop body in parallel, especially when the loop body is relatively small.
Example
The following example shows the basic structure of the parallel_for algorithm. This example prints to the console each value in the range [1, 5] in parallel.
// parallel-for-structure.cpp
// compile with: /EHsc
#include <ppl.h>
#include <array>
#include <sstream>
#include <iostream>
using namespace Concurrency;
using namespace std;
int wmain()
{
// Print each value from 1 to 5 in parallel.
parallel_for(1, 6, [](int value) {
wstringstream ss;
ss << value << L' ';
wcout << ss.str();
});
}
This example produces the following sample output:
1 2 4 3 5
Because the parallel_for algorithm acts on each item in parallel, the order in which the values are printed to the console will vary.
For a complete example that uses the parallel_for algorithm, see How to: Write a parallel_for Loop.
[go to top]
parallel_for_each Algorithm
The Concurrency::parallel_for_each algorithm performs tasks on an iterative container, such as those provided by the STL, in parallel. It uses the same partitioning logic that the parallel_for algorithm uses.
The parallel_for_each algorithm resembles the STL std::for_each algorithm, except that the parallel_for_each algorithm executes the tasks concurrently. Like other parallel algorithms, parallel_for_each does not execute the tasks in a specific order.
Although the parallel_for_each algorithm works on both forward iterators and random access iterators, it performs better with random access iterators.
Example
The following example shows the basic structure of the parallel_for_each algorithm. This example prints to the console each value in a std::array object in parallel.
// parallel-for-each-structure.cpp
// compile with: /EHsc
#include <ppl.h>
#include <array>
#include <sstream>
#include <iostream>
using namespace Concurrency;
using namespace std;
int wmain()
{
// Create an array of integer values.
array<int, 5> values = { 1, 2, 3, 4, 5 };
// Print each value in the array in parallel.
parallel_for_each(values.begin(), values.end(), [](int value) {
wstringstream ss;
ss << value << L' ';
wcout << ss.str();
});
}
This example produces the following sample output:
4 5 1 2 3
Because the parallel_for_each algorithm acts on each item in parallel, the order in which the values are printed to the console will vary.
For a complete example that uses the parallel_for_each algorithm, see How to: Write a parallel_for_each Loop.
[go to top]
parallel_invoke Algorithm
The Concurrency::parallel_invoke algorithm executes a set of tasks in parallel. It does not return until each task finishes. This algorithm is useful when you have several independent tasks that you want to execute at the same time.
The parallel_invoke algorithm takes as its parameters a series of work functions (lambda functions, function objects, or function pointers). The parallel_invoke algorithm is overloaded to take between two and ten parameters. Every function that you pass to parallel_invoke must take zero parameters.
Like other parallel algorithms, parallel_invoke does not execute the tasks in a specific order. The topic Task Parallelism (Concurrency Runtime) explains how the parallel_invoke algorithm relates to tasks and task groups.
Example
The following example shows the basic structure of the parallel_invoke algorithm. This example concurrently calls the twice function on three local variables and prints the result to the console.
// parallel-invoke-structure.cpp
// compile with: /EHsc
#include <ppl.h>
#include <string>
#include <iostream>
using namespace Concurrency;
using namespace std;
// Returns the result of adding a value to itself.
template <typename T>
T twice(const T& t) {
return t + t;
}
int wmain()
{
// Define several values.
int n = 54;
double d = 5.6;
wstring s = L"Hello";
// Call the twice function on each value concurrently.
parallel_invoke(
[&n] { n = twice(n); },
[&d] { d = twice(d); },
[&s] { s = twice(s); }
);
// Print the values to the console.
wcout << n << L' ' << d << L' ' << s << endl;
}
This example produces the following output:
108 11.2 HelloHello
For complete examples that use the parallel_invoke algorithm, see How to: Use parallel_invoke to Write a Parallel Sort Routine and How to: Use parallel_invoke to Execute Parallel Operations.
[go to top]
Related Topics
How to: Write a parallel_for Loop
Shows how to use the parallel_for algorithm to perform matrix multiplication.How to: Write a parallel_for_each Loop
Shows how to use the parallel_for_each algorithm to compute the count of prime numbers in a std::array object in parallel.How to: Use parallel_invoke to Write a Parallel Sort Routine
Shows how to use the parallel_invoke algorithm to improve the performance of the bitonic sort algorithm.How to: Use parallel_invoke to Execute Parallel Operations
Shows how to use the parallel_invoke algorithm to improve the performance of a program that performs multiple operations on a shared data source.Parallel Patterns Library (PPL)
Describes the PPL, which provides an imperative programming model that promotes scalability and ease-of-use for developing concurrent applications.Cancellation in the PPL
Explains the role of cancellation in the PPL, how to cancel parallel work, and how to determine when a task group is canceled.Exception Handling in the Concurrency Runtime
Explains the role of exception handling in the Concurrency Runtime.