Aliased Invocation of parallel_for_each in C++ AMP
We have already talked about how array, array_view, texture, and writeonly_texture_view are captured and passed into a parallel_for_each for use on the accelerator_view. In this blog, I’m going to draw your attention to a pattern of data capturing, and explain the caveats associated with it.
Aliased Invocation of parallel_for_each
Let’s first take a look on a typical parallel_for_each invocation:
array<int, 1> arr1(size); // code to initialize “arr1” is elided
array<int, 1> arr2(size);
parallel_for_each(arr2.extent, [&arr1, &arr2] (index<1> idx) restrict(amp) {
arr2[idx] = arr1[idx];
});
C++ AMP Compiler analyzes the parallel_for_each invocation, and generates host side code for data marshaling and launching the computation to the accelerator, and the device side code for performing the computation on the accelerator. Note the compiler only inspects the parallel_for_each call and does not try to inspect the rest of the code. Therefore, any compile time known knowledge is limited to what the parallel_for_each invocation can tell. In above example, the compiler would conclude that there are two “arrays” being supplied to parallel_for_each, and generates code assuming the use of two different buffers underlying the arrays. This assumption is actually correct for the example above. However, it may not always be true. For example, I can refactor the parallel_for_each in above example to a function:
void assign(const array<int, 1> & src, array<int, 1> & dst)
{
parallel_for_each(dst.extent, [&src, &dst] (index<1> idx) restrict(amp) {
dst[idx] = src[idx];
});
}
Then I can call this function as,
array<int, 1> arr1(size); // code to initialize “arr1” is elided
array<int, 1> arr2(size);
assign(arr1, arr2); // assign the content of arr1 to arr2
assign(arr1, arr1); // assign the content of arr1 to arr1
So the first invocation is the same as the original example -- the two arrays captured to the parallel_for_each in “assign” are indeed different. However, in the second invocation, the two arrays are the same, thus breaks the assumption that there are two different buffers. You can also see that whether the two arrays captured by reference are the same or different cannot be known at compilation time.
For arrays and textures that are captured by reference, and array_views and writeonly_texture_views that are captured by value, we refer them as captured containers.
We say two captured containers are aliased if a parallel_for_each invocation satisfies the following two conditions:
- if both refer to the same container on the accelerator_view where the parallel_for_each is launched, and
- at least one of the them is writable inside parallel_for_each.
Note if both are only readable (strictly speaking, it’s also a type of aliasing, i.e. read-read aliasing), we don’t consider them as aliased. A parallel_for_each invocation that uses aliased containers is called an aliased invocation.
Here is another example of aliased invocation triggered by using array_views:
array<int, 1> arr3(size * 2); // code to initialize “arr3” is elided
array_view<int, 1> first_half = arr3.section(0, size);
array_view<int, 1> second_half = arr3.section(size, size);
parallel_for_each(second_half.extent, [=] (index<1> idx) restrict(amp) {
second_half[idx] = first_half[idx];
});
Both “first_half” and “second_half” are created using section method of array class. The two array_views represents the two sections of the same underlying container (i.e. “arr3”). In this case, the invocation of parallel_for_each is also an aliased invocation.
We call array_views created directly on top of host memory or host containers as top-level array_views. All top-level array_views are considered distinct when captured to a parallel_for_each. Even though they might be overlapping on the same host memory region, the C++ AMP runtime creates and manages a distinct copy for each of them on the accelerator_view where the parallel_for_each is launched. As a result, top-level array_views do not cause aliased invocations. For example,
std::vector<int> vec(size * 2); // code to initialize “vec” is elided
array_view<int, 1> av_all(size * 2, vec);
array_view<int, 1> av_first_half(size, vec);
parallel_for_each(av_first_half.extent, [=] (index<1> idx) restrict(amp) {
av_all[idx + size] = first_half[idx];
});
Even “av_all” and “av_first_half” overlaps on host container “vec”, the invocation is not an aliased invocation since both are top-level array_views, However, if we change
array_view<int, 1> av_first_half(size, vec);
to
array_view<int, 1> av_first_half = av_all.section(0, size);
Then the invocation becomes an aliased one, because “av_first_half” is not a top-level array_view, it uses the same copy of av_all on the accelerator_view.
So now you know what aliased invocation of a parallel_for_each is and you may appreciate the good news that C++ AMP support aliased invocation when capturing array references and array_views. However, there are three caveats, as detailed in the following three sections.
Caveat 1: Performance of aliased kernel
The C++ AMP compiler generates two DirectX compute shaders for a parallel_for_each: one for the non-aliased invocation and the other for the aliased invocation. The C++ AMP runtime examines the input, decide whether there is aliasing, and then selects the corresponding shader to execute.
The non-aliased shader assumes there is no aliasing, thus efficient code is generated. The aliased shader, on the other hand, our current implementation on top of DirectX could be a lot more conservative on code generation, for reasons that are beyond the scope of this post. Depending on the characteristics and complexity of the code, the driver, and the accelerator characteristics, the aliased shader could be much slower than the non-aliased shader. So it’s important to keep this in mind and try to avoid aliased invocations if you can.
When we talked about how to measure the performance of a parallel_for_each invocation, we mentioned that there is a Just-in-time (JIT) compilation overhead for the first time invocation of a parallel_for_each. It’s worth noting that such JIT cost is paid for the non-aliased shader at the first non-aliased invocation, and similarly for the aliased shader at the first aliased invocation.
Caveat 2: Doesn’t work with interop
For arrays created using DirectX Interop, the C++ AMP runtime does not detect whether there is aliasing involving such arrays. If there is no other aliasing, the runtime just selects the non-aliased shader. So this can lead to either runtime exception or undefined result.
Caveat 3: Texture aliasing scenario not supported
In this release, C++ AMP does not support aliased invocation when the source of aliasing is from the use of texture , and/or writeonly_texture_view . For example, assume I have a function like:
void assign(const texture<int, 1> & a, texture<int, 1> & b)
{
parallel_for_each(b.extent, [&a, &b] (index<1> idx) restrict(amp)
{
b.set(idx, a[idx]);
});
}
…and I call the function with the following code,
texture<int, 1> tex1(16); // code to initialize “tex1” is elided
assign(tex1, tex1);
Inside “assign”, both “a” and “b” are actually referring to the same “tex1”. In this case, you will get a runtime_exception as
Read/Write aliasing is detected between two textures (texture_view's). RW-aliasing is not allowed for texture.
Similarly, the following code will also trigger the exception:
texture<int_2, 1> tex2(16); // code to initialize “tex2” is elided
writeonly_texture_view<int_2, 1> wo_tv2(tex2);
parallel_for_each(extent<1>(16), [=, &tex2] (index<1> idx) restrict(amp) {
wo_tv2.set(idx, tex2[idx]); // read from tex2, write via wo_tv2
});
This is because “tex2” and “wo_tv2” refer to the same texture object “tex2”. So this is also a case of aliasing.
In Closing
In this post, I explained the aliased invocation of parallel_for_each, and showed you that C++ AMP does support such aliased invocation with caveats. As usual, you are welcome to ask questions and provide feedback below or on our MSDN Forum.