tiled_extent::pad in C++ AMP

In my previous post about the tiled_extent divisibility requirement I mentioned that you could satisfy the requirement by padding the tiled_extent, and also modify your code to account for the extraneous threads. I’ll now describe this technique in greater detail, in the context of the matrix transpose sample.

Using the tiled_extent::pad function

Padding a tiled_extent is easy. You simply call .pad() on a potentially unevenly-divisible tiled extent and that will yield a tiled_extent instance which is padded to a tile-size multiple, in each dimension. This is demonstrated by the following code and its output:

extent<2> e(999,666);
tiled_extent<16,16> tiled_e = e.tile<16,16>();
tiled_extent<16,16> padded_e = tiled_e.pad();

 

std::cout << "tiled_e extents are: (" << tiled_e[0] << "," << tiled_e[1] << ")" << std::endl;
std::cout << "padded_e extents are: (" << padded_e[0] << "," << padded_e[1] << ")" << std::endl;

Program output:

tile_e extents are: (999,666)
padded_e extents are: (1008,672)

Example – matrix transpose

As we have discussed before, in the matrix transpose sample we are computing a transpose matrix of an input matrix. In the sample, a matrix which is evenly divisible by the tiles size is transposed with this function:

template <typename value_type>
void transpose_tiled_even(
    const array_view<const value_type,2>& A,
    const array_view<value_type,2>& Atranspose)
{
    assert(A.extent == transpose(Atranspose.extent));
    assert(A.extent % tile_size == extent<2>(0,0));
 

    Atranspose.discard_data();
    parallel_for_each(
        A.extent.tile<tile_size,tile_size>(),
        [=] (tiled_index<tile_size,tile_size> tidx) restrict(amp) {
            tile_static float t1[tile_size][tile_size];
            t1[tidx.local[1]][tidx.local[0]] = A[tidx.global];
            tidx.barrier.wait();
            index<2> idxdst(transpose(tidx.tile_origin) + tidx.local);
            Atranspose[idxdst] = t1[tidx.local[0]][tidx.local[1]];
        }
    );
}

In order to start handling unevenly-divided matrix extents, we will do the following:

First, we will create helpers functions (guarded_read and guarded_write) for reading and writing values from and into array views, but only if the index specified is within the bounds of the array view. If a read is requested which is out of bounds, a default value is returned. Similarly a request to write a value into a location which is out of bounds is ignored.

template <typename value_type>
value_type guarded_read(const array_view<const value_type,2>& A, const index<2>& idx) restrict(cpu,amp)
{
    return A.extent.contains(idx) ? A[idx] : value_type();
}

template <typename value_type>
void guarded_write(const array_view<value_type,2>& A, const index<2>& idx, const value_type& val) restrict(cpu,amp)
{
    if (A.extent.contains(idx))
        A[idx] = val;
}

Second, we will use tiled_extent::pad in order to round the number of threads that we ask to schedule up to the nearest multiple of the tile size. The resulting tiled_extent is passed to parallel_for_each:

parallel_for_each(
        A.extent.tile<tile_size,tile_size>().pad(), ...

Finally, we will modify the original lambda such that any global memory read, is replaced with a guarded read, and every global memory write, is replaced with a guarded write. This gives us this function, with all changes highlighted:

template <typename value_type>
void transpose_tiled_pad(
    const array_view<const value_type,2>& A,
    const array_view<value_type,2>& Atranspose)
{
    assert(A.extent == transpose(Atranspose.extent));
    Atranspose.discard_data();
    parallel_for_each(
        A.extent.tile<tile_size,tile_size>().pad(),
        [=] (tiled_index<tile_size,tile_size> tidx) restrict(amp) {
            t1[tidx.local[1]][tidx.local[0]] = guarded_read(A, tidx.global);
            tidx.barrier.wait();
            index<2> idxdst(transpose(tidx.tile_origin) + tidx.local);
            guarded_write( Atranspose, idxdst, t1[tidx.local[0]][tidx.local[1]]);
        }
    );
}

Essentially, this ensures that the extra threads have zero side-effects. It’s as if they were never there! As you can see, this change is fairly mechanical and simple to apply.

In the next blog post in this series we will discuss how subtracting threads, rather than adding them, using tiled_extent::truncate , can be used as well to solve the same problem.