使用平铺

项目
02/21/2013

可以使用平铺以最大化您的应用程序的加速度。平铺将线程划分为相等的矩形子集或拼贴。如果您使用的相应拼贴大小和平铺的算法，可以在 AMP C++ 代码中获取更多的加速。拼贴的基本组件是：

tile_static变量。平铺的主要好处是性能增益，从tile_static访问。在数据访问tile_static内存可以访问的全局空间中的数据比快得多（array或array_view对象）。实例的tile_static变量，将创建的每个图块，并在图块中的所有线程都有权访问该变量。在典型的平铺算法，数据被复制到tile_static一次从全局内存的内存，然后访问很多时候，从tile_static内存。
tile_barrier::wait 方法.调用tile_barrier::wait挂起当前线程的执行，直到所有相同的拼贴中的线程到达调用tile_barrier::wait。您无法保证线程将在中运行，仅的拼贴中的没有线程将执行以前的调用顺序tile_barrier::wait之前的所有线程已到达该调用。这意味着，通过使用tile_barrier::wait方法，您可以在平铺的平铺的基础，而不是在线程的线程的基础上执行任务。一种典型的拼贴的算法已初始化的代码tile_static内存的整个拼贴跟调用tile_barrer::wait。下面的代码tile_barrier::wait包含需要访问所有的计算tile_static的值。
本地和全局索引。您有权访问的线程相对于整个索引array_view或array对象，相对于该图块的索引。使用本地的索引可以使代码更易于阅读和调试。通常，可以使用本地索引来访问tile_static变量和全局访问索引array和array_view变量。
tiled_extent 类和 tiled_index 类。使用tiled_extent对象而不是extent对象中parallel_for_each调用。使用tiled_index对象而不是index对象中parallel_for_each调用。

若要利用的拼贴，您的算法必须计算域划分为拼贴，然后将复制到的图块数据tile_static变量，以提高访问速度。

示例的全局、平铺和本地索引

下图表示数据排列在 2 x 3 拼贴中的 8 x 9 的矩阵。

拆分为 2x3 平铺的 8x9 矩阵

下面的示例显示全局，平铺和本地索引的平铺的矩阵。array_view对象使用创建的元素类型的Description。Description包含全局图块和本地的矩阵中的元素的索引。该代码在调用parallel_for_each设置的全局值、平铺和本地的每个元素的索引。该输出显示中的值Description结构。

#include <iostream>
#include <iomanip>
#include <Windows.h>
#include <amp.h>
using namespace concurrency;

const int ROWS = 8;
cons tint COLS = 9;

// tileRow and tileColumn specify the tile that each thread is in.
// globalRow and globalColumn specify the location of the thread in the array_view.
// localRow and localColumn specify the location of the thread relative to the tile.
struct Description {
    int value;
    int tileRow;
    int tileColumn;
    int globalRow;
    int globalColumn;
    int localRow;
    int localColumn;
};

// A helper function for formatting the output.
void SetConsoleColor(int color) {
    int colorValue = (color == 0) ? 4 : 2;
    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colorValue);
}

// A helper function for formatting the output.
void SetConsoleSize(int height, int width) {
    COORD coord; coord.X = width; coord.Y = height;
    SetConsoleScreenBufferSize(GetStdHandle(STD_OUTPUT_HANDLE), coord);
    SMALL_RECT* rect = new SMALL_RECT();
    rect->Left = 0; rect->Top = 0; rect->Right = width; rect->Bottom = height;
    SetConsoleWindowInfo(GetStdHandle(STD_OUTPUT_HANDLE), true, rect);
}

// This method creates a 4x4 matrix of Description structures. In the call to parallel_for_each, the structure is updated 
// with tile, global, and local indices.
void TilingDescription() {
    // Create 16 (4x4) Description structures.
    std::vector<Description> descs;
    for (int i = 0; i < ROWS * COLS; i++) {
        Description d = {i, 0, 0, 0, 0, 0, 0};
        descs.push_back(d);
    }

    // Create an array_view from the Description structures.
    extent<2> matrix(ROWS, COLS);
    array_view<Description, 2> descriptions(matrix, descs);

    // Update each Description with the tile, global, and local indices.
    parallel_for_each(descriptions.extent.tile< 2, 3>(),
         [= ] (tiled_index< 2, 3> t_idx) restrict(amp) 
    {
        descriptions[t_idx].globalRow = t_idx.global[0];
        descriptions[t_idx].globalColumn = t_idx.global[1];
        descriptions[t_idx].tileRow = t_idx.tile[0];
        descriptions[t_idx].tileColumn = t_idx.tile[1];
        descriptions[t_idx].localRow = t_idx.local[0];
        descriptions[t_idx].localColumn= t_idx.local[1];
    });

    // Print out the Description structure for each element in the matrix.
    // Tiles are displayed in red and green to distinguish them from each other.
    SetConsoleSize(100, 150);
    for (int row = 0; row < ROWS; row++) {
        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Value: " << std::setw(2) << descriptions(row, column).value << "      ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Tile:   " << "(" << descriptions(row, column).tileRow << "," << descriptions(row, column).tileColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Global: " << "(" << descriptions(row, column).globalRow << "," << descriptions(row, column).globalColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Local:  " << "(" << descriptions(row, column).localRow << "," << descriptions(row, column).localColumn << ")  ";
        }
        std::cout << "\n";
        std::cout << "\n";
    }
}

void main() {
    TilingDescription();
    char wait;
    std::cin >> wait;
}

该示例的主要工作是在定义中的array_view对象并调用parallel_for_each。

向量的Description结构复制到 8 x 9 array_view对象。
parallel_for_each调用方法时， tiled_extent作为计算域的对象。tiled_extent对象由调用extent::tile()方法的descriptions变量。类型参数调用的extent::tile()， <2,3>，指定创建 2 x 3 拼贴。因此，矩阵 8 x 9 平到 12 平铺、四行和三列。
parallel_for_each方法通过调用**tiled_index<2,3>**对象 (t_idx）作为索引。索引的类型参数 (t_idx）必须匹配的计算域的类型参数 (descriptions.extent.tile< 2, 3>())。
每个线程执行时，索引t_idx返回信息有关的平铺在线程处于 (tiled_index::tile属性）和拼贴中的线程的位置 (tiled_index::local属性）。

平铺同步--tile_static 和 tile_barrier::wait

前面的示例所示的麻将牌布局和索引，但不在本身非常有用。平铺变得有用时是不可或缺的一部分的算法和利用此漏洞的拼贴tile_static变量。因为平铺视图中的所有线程都可以访问tile_static变量，以调用tile_barrier::wait用于同步对访问tile_static变量。虽然有权访问所有图块中的线程tile_static变量，没有保证的顺序的拼贴中的线程执行。下面的示例演示如何使用tile_static变量和tile_barrier::wait方法来计算每个拼贴的平均值。下面是示例要点：

RawData 存储在一个 8 x 8 矩阵中。
图块大小为 2 x 2。这将创建 4 x 4 网格的拼贴和平均值可以通过使用存储在一个 4x4 矩阵array对象。有是有限的您可以捕获存储系统的受限制的函数中引用的类型。array类是其中之一。
通过使用定义了矩阵大小和取样大小#define语句，因为类型参数，以array， array_view， extent，和tiled_index必须是常量值。您还可以使用const int static声明。还有一个优点，是不重要更改示例大小来计算平均超过 4 x 4 拼贴。
A tile_static 2 x 2 的浮点值的数组声明的每个拼贴。虽然该声明是在每个线程的代码路径中，只有一个阵列创建矩阵中的每个拼贴。
要将值复制到每个图块中的代码行tile_static数组。对于每个线程后的值将复制到数组中，, 在线程上停止执行为了调用tile_barrier::wait。
当所有图块中的线程已经达到了屏障时，可以计算平均值。由于每个线程执行的代码，没有if语句，以只计算平均在一个线程上的。平均值平均值变量中存储。障碍是本质上是由平铺，控制计算的构造，不如说您可能需要使用for循环。
中的数据averages变量，因为它是array对象，必须将复制到主机。此示例使用矢量转换运算符。
在完成示例中，您可以更改 SAMPLESIZE 为 4，并执行正确的代码而无需任何其他更改。

#include <iostream>
#include <amp.h>
using namespace concurrency;

#define SAMPLESIZE 2
#define MATRIXSIZE 8
void SamplingExample() {

    // Create data and array_view for the matrix.
    std::vector<float> rawData;
    for (int i = 0; i < MATRIXSIZE * MATRIXSIZE; i++) {
        rawData.push_back((float)i);
    }
    extent<2> dataExtent(MATRIXSIZE, MATRIXSIZE);
    array_view<float, 2> matrix(dataExtent, rawData);

    // Create the array for the averages.
    // There is one element in the output for each tile in the data.
    std::vector<float> outputData;
    int outputSize = MATRIXSIZE / SAMPLESIZE;
    for (int j = 0; j < outputSize * outputSize; j++) {
        outputData.push_back((float)0);
    }
    extent<2> outputExtent(MATRIXSIZE / SAMPLESIZE, MATRIXSIZE / SAMPLESIZE);
    array<float, 2> averages(outputExtent, outputData.begin(), outputData.end());

    // Use tiles that are SAMPLESIZE x SAMPLESIZE.
    // Find the average of the values in each tile.
    // The only reference-type variable you can pass into the parallel_for_each call
    // is a concurrency::array.
    parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
         [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp) 
    {
        // Copy the values of the tile into a tile-sized array.
        tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
        tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

        // Wait for the tile-sized array to load before you calculate the average.
        t_idx.barrier.wait();

        // If you remove the if statement, then the calculation executes for every
        // thread in the tile, and makes the same assignment to averages each time.
        if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
            for (int trow = 0; trow < SAMPLESIZE; trow++) {
                for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                    averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
                }
            }
            averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
        }
    });

    // Print out the results.
    // You cannot access the values in averages directly. You must copy them
    // back to a CPU variable.
    outputData = averages;
    for (int row = 0; row < outputSize; row++) {
        for (int col = 0; col < outputSize; col++) {
            std::cout << outputData[row*outputSize + col] << " ";
        }
        std::cout << "\n";
    }
    // Output for SAMPLESSIZE = 2 is:
    //  4.5  6.5  8.5 10.5
    // 20.5 22.5 24.5 26.5
    // 36.5 38.5 40.5 42.5
    // 52.5 54.5 56.5 58.5

    // Output for SAMPLESIZE = 4 is:
    // 13.5 17.5
    // 45.5 49.5
}

int main() {
    SamplingExample();
}

争用条件

它很可能创建tile_static变量名为total和递增该变量的每个线程，像下面这样：

// Do not do this.
tile_static float total;
total += matrix[t_idx];
t_idx.barrier.wait();
averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);

这种方法的第一个问题是， tile_static变量不能有初始值设定项。第二个问题是工作分配是否存在竞态条件total，这是因为所有拼贴中的线程在无特定顺序有权访问该变量。无法为只允许一个线程访问的总在每个障碍，如下所示的算法进行编程。但是，此解决方案不能扩展。

// Do not do this.
tile_static float total;
if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
    total = matrix[t_idx];
}
t_idx.barrier.wait();

if (t_idx.local[0] == 0 && t_idx.local[1] == 1) {
    total += matrix[t_idx];
}
t_idx.barrier.wait();

// etc.

内存范围

有两种类型的内存访问，必须进行同步的--全局内存访问和tile_static内存访问。A concurrency::array对象分配仅全局内存。A concurrency::array_view可以引用全局内存， tile_static的内存，或二者具体取决于它如何构造。有两种类型的内存，必须进行同步的：

全局内存
tile_static

A 内存围墙可确保访问可供其他线程的线程的拼贴，在内存和内存的访问根据程序顺序执行。要确保这种情况，编译器和处理器不重排读取和写入操作跨围墙。在 C++ A，内存围墙创建通过调用这些方法之一：

tile_barrier::wait 方法：创建一这两种保护全局和tile_static内存。
tile_barrier::wait_with_all_memory_fence 方法：创建一这两种保护全局和tile_static内存。
tile_barrier::wait_with_global_memory_fence 方法：创建围墙周围仅全局内存。
tile_barrier::wait_with_tile_static_memory_fence 方法：创建围墙周围只有tile_static内存。

调用特定围墙，您需要可以提高您的应用程序的性能。障碍类型将影响编译器和硬件重新语句的排序。例如，如果您使用一种全局内存保护，它将应用到全局内存访问，因此，编译器和硬件可能会重新读取和写入tile_static围墙的两个方面中的变量。

在下一个示例中，障碍会同步写入tileValues、 tile_static变量。在此示例中， tile_barrier::wait_with_tile_static_memory_fence而不是调用tile_barrier::wait。

// Using a tile_static memory fence.
parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
     [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp) 
{
    // Copy the values of the tile into a tile-sized array.
    tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
    tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

    // Wait for the tile-sized array to load before calculating the average.
    t_idx.barrier.wait_with_tile_static_memory_fence();

    // If you remove the if statement, then the calculation executes for every
    // thread in the tile, and makes the same assignment to averages each time.
    if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
        for (int trow = 0; trow < SAMPLESIZE; trow++) {
            for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
            }
        }
        averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
    }
});

通过

使用平铺

示例的全局、平铺和本地索引

平铺同步--tile_static 和 tile_barrier::wait

争用条件

内存范围

请参见

参考

其他资源

其他资源

通过

使用平铺

示例的全局、 平铺和本地索引

平铺同步--tile_static 和 tile_barrier::wait

争用条件

内存范围

请参见

参考

其他资源

其他资源

示例的全局、平铺和本地索引