How to: Use parallel_invoke to Write a Parallel Sort Routine
Article
This document describes how to use the parallel_invoke algorithm to improve the performance of the bitonic sort algorithm. The bitonic sort algorithm recursively divides the input sequence into smaller sorted partitions. The bitonic sort algorithm can run in parallel because each partition operation is independent of all other operations.
Although the bitonic sort is an example of a sorting network that sorts all combinations of input sequences, this example sorts sequences whose lengths are a power of two.
The following example shows the serial version of the bitonic sort algorithm. The bitonic_sort function divides the sequence into two partitions, sorts those partitions in opposite directions, and then merges the results. This function calls itself two times recursively to sort each partition.
C++
constbool INCREASING = true;
constbool DECREASING = false;
// Comparator function for the bitonic sort algorithm.template <classT>
voidcompare(T* items, inti, intj, booldir)
{if (dir == (items[i] > items[j]))
{
swap(items[i], items[j]);
}
}
// Sorts a bitonic sequence in the specified order.template <classT>
voidbitonic_merge(T* items, intlo, intn, booldir)
{if (n > 1)
{
int m = n / 2;
for (int i = lo; i < lo + m; ++i)
{
compare(items, i, i + m, dir);
}
bitonic_merge(items, lo, m, dir);
bitonic_merge(items, lo + m, m, dir);
}
}
// Sorts the given sequence in the specified order.template <classT>
voidbitonic_sort(T* items, intlo, intn, booldir)
{if (n > 1)
{
// Divide the array into two partitions and then sort // the partitions in different directions.int m = n / 2;
bitonic_sort(items, lo, m, INCREASING);
bitonic_sort(items, lo + m, m, DECREASING);
// Merge the results.
bitonic_merge(items,lo, n, dir);
}
}
// Sorts the given sequence in increasing order.template <classT>
voidbitonic_sort(T* items, intsize)
{
bitonic_sort(items, 0, size, INCREASING);
}
Using parallel_invoke to Perform Bitonic Sort in Parallel
This section describes how to use the parallel_invoke algorithm to perform the bitonic sort algorithm in parallel.
To perform the bitonic sort algorithm in parallel
Add a #include directive for the header file ppl.h.
C++
#include<ppl.h>
Add a using directive for the concurrency namespace.
C++
usingnamespace concurrency;
Create a new function, called parallel_bitonic_mege, which uses the parallel_invoke algorithm to merge the sequences in parallel if there is sufficient amount of work to do. Otherwise, call bitonic_merge to merge the sequences serially.
C++
// Sorts a bitonic sequence in the specified order.template <classT>
voidparallel_bitonic_merge(T* items, intlo, intn, booldir)
{// Merge the sequences concurrently if there is sufficient work to do.if (n > 500)
{
int m = n / 2;
for (int i = lo; i < lo + m; ++i)
{
compare(items, i, i + m, dir);
}
// Use the parallel_invoke algorithm to merge the sequences in parallel.
parallel_invoke(
[&items,lo,m,dir] { parallel_bitonic_merge(items, lo, m, dir); },
[&items,lo,m,dir] { parallel_bitonic_merge(items, lo + m, m, dir); }
);
}
// Otherwise, perform the work serially.elseif (n > 1)
{
bitonic_merge(items, lo, n, dir);
}
}
Perform a process that resembles the one in the previous step, but for the bitonic_sort function.
C++
// Sorts the given sequence in the specified order.template <classT>
voidparallel_bitonic_sort(T* items, intlo, intn, booldir)
{if (n > 1)
{
// Divide the array into two partitions and then sort // the partitions in different directions.int m = n / 2;
// Sort the partitions in parallel.
parallel_invoke(
[&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
[&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
);
// Merge the results.
parallel_bitonic_merge(items, lo, n, dir);
}
}
Create an overloaded version of the parallel_bitonic_sort function that sorts the array in increasing order.
C++
// Sorts the given sequence in increasing order.template <classT>
voidparallel_bitonic_sort(T* items, intsize)
{
parallel_bitonic_sort(items, 0, size, INCREASING);
}
The parallel_invoke algorithm reduces overhead by performing the last of the series of tasks on the calling context. For example, in the parallel_bitonic_sort function, the first task runs on a separate context, and the second task runs on the calling context.
C++
// Sort the partitions in parallel.
parallel_invoke(
[&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
[&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
);
The following complete example performs both the serial and the parallel versions of the bitonic sort algorithm. The example also prints to the console the time that is required to perform each computation.
C++
// parallel-bitonic-sort.cpp// compile with: /EHsc#include<windows.h>#include<algorithm>#include<iostream>#include<random>#include<ppl.h>usingnamespace concurrency;
usingnamespacestd;
// Calls the provided work function and returns the number of milliseconds // that it takes to call that function.template <classFunction>
__int64time_call(Function&& f)
{
__int64 begin = GetTickCount();
f();
return GetTickCount() - begin;
}
constbool INCREASING = true;
constbool DECREASING = false;
// Comparator function for the bitonic sort algorithm.template <classT>
voidcompare(T* items, inti, intj, booldir)
{if (dir == (items[i] > items[j]))
{
swap(items[i], items[j]);
}
}
// Sorts a bitonic sequence in the specified order.template <classT>
voidbitonic_merge(T* items, intlo, intn, booldir)
{if (n > 1)
{
int m = n / 2;
for (int i = lo; i < lo + m; ++i)
{
compare(items, i, i + m, dir);
}
bitonic_merge(items, lo, m, dir);
bitonic_merge(items, lo + m, m, dir);
}
}
// Sorts the given sequence in the specified order.template <classT>
voidbitonic_sort(T* items, intlo, intn, booldir)
{if (n > 1)
{
// Divide the array into two partitions and then sort // the partitions in different directions.int m = n / 2;
bitonic_sort(items, lo, m, INCREASING);
bitonic_sort(items, lo + m, m, DECREASING);
// Merge the results.
bitonic_merge(items,lo, n, dir);
}
}
// Sorts the given sequence in increasing order.template <classT>
voidbitonic_sort(T* items, intsize)
{
bitonic_sort(items, 0, size, INCREASING);
}
// Sorts a bitonic sequence in the specified order.template <classT>
voidparallel_bitonic_merge(T* items, intlo, intn, booldir)
{// Merge the sequences concurrently if there is sufficient work to do.if (n > 500)
{
int m = n / 2;
for (int i = lo; i < lo + m; ++i)
{
compare(items, i, i + m, dir);
}
// Use the parallel_invoke algorithm to merge the sequences in parallel.
parallel_invoke(
[&items,lo,m,dir] { parallel_bitonic_merge(items, lo, m, dir); },
[&items,lo,m,dir] { parallel_bitonic_merge(items, lo + m, m, dir); }
);
}
// Otherwise, perform the work serially.elseif (n > 1)
{
bitonic_merge(items, lo, n, dir);
}
}
// Sorts the given sequence in the specified order.template <classT>
voidparallel_bitonic_sort(T* items, intlo, intn, booldir)
{if (n > 1)
{
// Divide the array into two partitions and then sort // the partitions in different directions.int m = n / 2;
// Sort the partitions in parallel.
parallel_invoke(
[&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
[&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
);
// Merge the results.
parallel_bitonic_merge(items, lo, n, dir);
}
}
// Sorts the given sequence in increasing order.template <classT>
voidparallel_bitonic_sort(T* items, intsize)
{
parallel_bitonic_sort(items, 0, size, INCREASING);
}
intwmain(){
// For this example, the size must be a power of two.constint size = 0x200000;
// Create two large arrays and fill them with random values.int* a1 = newint[size];
int* a2 = newint[size];
mt19937 gen(42);
for(int i = 0; i < size; ++i)
{
a1[i] = a2[i] = gen();
}
__int64 elapsed;
// Perform the serial version of the sort.
elapsed = time_call([&] { bitonic_sort(a1, size); });
wcout << L"serial time: " << elapsed << endl;
// Now perform the parallel version of the sort.
elapsed = time_call([&] { parallel_bitonic_sort(a2, size); });
wcout << L"parallel time: " << elapsed << endl;
delete[] a1;
delete[] a2;
}
The following sample output is for a computer that has four processors.
To compile the code, copy it and then paste it in a Visual Studio project, or paste it in a file that is named parallel-bitonic-sort.cpp and then run the following command in a Visual Studio Command Prompt window.
cl.exe /EHsc parallel-bitonic-sort.cpp
Robust Programming
This example uses the parallel_invoke algorithm instead of the concurrency::task_group class because the lifetime of each task group does not extend beyond a function. We recommend that you use parallel_invoke when you can because it has less execution overhead than task group objects, and therefore lets you write better performing code.
The parallel versions of some algorithms perform better only when there is sufficient work to do. For example, the parallel_bitonic_merge function calls the serial version, bitonic_merge, if there are 500 or fewer elements in the sequence. You can also plan your overall sorting strategy based on the amount of work. For example, it might be more efficient to use the serial version of the quick sort algorithm if the array contains fewer than 500 items, as shown in the following example:
C++
template <classT>
voidquick_sort(T* items, intlo, intn)
{// TODO: The function body is omitted for brevity.
}
template <classT>
voidparallel_bitonic_sort(T* items, intlo, intn, booldir)
{// Use the serial quick sort algorithm if there are relatively few// items to sort. The associated overhead for running few tasks in // parallel may not overcome the benefits of parallel processing.if (n - lo + 1 <= 500)
{
quick_sort(items, lo, n);
}
elseif (n > 1)
{
// Divide the array into two partitions and then sort // the partitions in different directions.int m = n / 2;
// Sort the partitions in parallel.
parallel_invoke(
[&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
[&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
);
// Merge the results.
parallel_bitonic_merge(items, lo, n, dir);
}
}
As with any parallel algorithm, we recommend that you profile and tune your code as appropriate.
Azure HPC is a purpose-built cloud capability for HPC & AI workload, using leading-edge processors and HPC-class InfiniBand interconnect, to deliver the best application performance, scalability, and value. Azure HPC enables users to unlock innovation, productivity, and business agility, through a highly available range of HPC & AI technologies that can be dynamically allocated as your business and technical needs change. This learning path is a series of modules that help you get started on Azure HPC - you