使用瓷磚

你可以用平鋪來提高應用程式的性能。平鋪將線程分割成相等的矩形子集或瓦片。如果你使用適當的圖塊大小和圖塊演算法，C++ AMP 程式碼還能獲得更多加速。平鋪的基本組成包括：

tile_static 變數。平鋪的主要好處是透過存取帶來的效能提升 tile_static 。存取記憶體中的 tile_static 資料速度可能遠快於存取全域空間（array 或 array_view 物件）中的資料。系統會為每個磚建立變數的 tile_static 實例，而磚中的所有線程都可以存取變數。在典型的磁磚演算法中，資料會從全域記憶體複製一次到 tile_static 記憶體，然後多次從 tile_static 記憶體存取。
tile_barrier：：wait 方法。呼叫 tile_barrier::wait 會暫停當前執行緒的執行，直到同一圖塊中的所有執行緒都達到呼叫 tile_barrier::wait。你無法保證執行緒的執行順序，只能保證在所有執行緒都到達呼叫前，圖塊中沒有執行緒會執行超過呼叫 tile_barrier::wait 的呼叫。這表示使用此 tile_barrier::wait 方法後，你可以逐圖格執行任務，而非逐線程執行。典型的平鋪演算法會先初始化tile_static整個圖塊的記憶體，然後呼叫tile_barrier::wait。接下來 tile_barrier::wait 的程式碼包含需要存取所有 tile_static 值的計算。
本地與全球索引。你可以存取相對於整個 array_view 或 array 物件的執行緒索引，以及相對於圖塊的索引。使用本地索引可以讓你的程式碼更易閱讀和除錯。通常，你會用局部索引來存取 tile_static 變數，並用全域索引來存取 array 和 array_view 變數。
tiled_extent類和 tiled_index類。你在parallel_for_each呼叫中使用tiled_extent物件而非extent物件。你使用tiled_index物件而不是index物件在parallel_for_each呼叫中。

為了利用分塊，你的演算法必須將計算域劃分成區塊，然後再將區塊資料複製到 tile_static 變數中以加快存取速度。

全域索引、磁磚索引與局部索引範例

備註

從 Visual Studio 2022 17.0 版開始，C++ AMP 標頭已被取代。包含任何 AMP 標頭將會產生建置錯誤。先定義 _SILENCE_AMP_DEPRECATION_WARNINGS ，再包含任何 AMP 標頭以讓警告消失。

下圖表示一個 8x9 的資料矩陣，排列成 2x3 的圖塊。

一張8乘9矩陣分割成2乘3格的圖示。

以下範例顯示此分塊矩陣的全域索引、分塊索引與局部索引。 array_view物件是透過使用型別Description為的元素來建立的。 Description 持有該矩陣中元素的全域、瓦片及局部索引。呼叫中的 parallel_for_each 程式碼設定了每個元素的全域索引、磁磚索引和局部索引值。輸出會顯示結構中的 Description 數值。

#include <iostream>
#include <iomanip>
#include <Windows.h>
#include <amp.h>
using namespace concurrency;

const int ROWS = 8;
const int COLS = 9;

// tileRow and tileColumn specify the tile that each thread is in.
// globalRow and globalColumn specify the location of the thread in the array_view.
// localRow and localColumn specify the location of the thread relative to the tile.
struct Description {
    int value;
    int tileRow;
    int tileColumn;
    int globalRow;
    int globalColumn;
    int localRow;
    int localColumn;
};

// A helper function for formatting the output.
void SetConsoleColor(int color) {
    int colorValue = (color == 0)  4 : 2;
    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colorValue);
}

// A helper function for formatting the output.
void SetConsoleSize(int height, int width) {
    COORD coord;

    coord.X = width;
    coord.Y = height;
    SetConsoleScreenBufferSize(GetStdHandle(STD_OUTPUT_HANDLE), coord);

    SMALL_RECT* rect = new SMALL_RECT();
    rect->Left = 0;
    rect->Top = 0;
    rect->Right = width;
    rect->Bottom = height;
    SetConsoleWindowInfo(GetStdHandle(STD_OUTPUT_HANDLE), true, rect);
}

// This method creates an 8x9 matrix of Description structures.
// In the call to parallel_for_each, the structure is updated
// with tile, global, and local indices.
void TilingDescription() {
    // Create 72 (8x9) Description structures.
    std::vector<Description> descs;
    for (int i = 0; i < ROWS * COLS; i++) {
        Description d = {i, 0, 0, 0, 0, 0, 0};
        descs.push_back(d);
    }

    // Create an array_view from the Description structures.
    extent<2> matrix(ROWS, COLS);
    array_view<Description, 2> descriptions(matrix, descs);

    // Update each Description with the tile, global, and local indices.
    parallel_for_each(descriptions.extent.tile< 2, 3>(),
        [=] (tiled_index< 2, 3> t_idx) restrict(amp)
    {
        descriptions[t_idx].globalRow = t_idx.global[0];
        descriptions[t_idx].globalColumn = t_idx.global[1];
        descriptions[t_idx].tileRow = t_idx.tile[0];
        descriptions[t_idx].tileColumn = t_idx.tile[1];
        descriptions[t_idx].localRow = t_idx.local[0];
        descriptions[t_idx].localColumn= t_idx.local[1];
    });

    // Print out the Description structure for each element in the matrix.
    // Tiles are displayed in red and green to distinguish them from each other.
    SetConsoleSize(100, 150);
    for (int row = 0; row < ROWS; row++) {
        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Value: " << std::setw(2) << descriptions(row, column).value << "      ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Tile:   " << "(" << descriptions(row, column).tileRow << "," << descriptions(row, column).tileColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Global: " << "(" << descriptions(row, column).globalRow << "," << descriptions(row, column).globalColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Local:  " << "(" << descriptions(row, column).localRow << "," << descriptions(row, column).localColumn << ")  ";
        }
        std::cout << "\n";
        std::cout << "\n";
    }
}

int main() {
    TilingDescription();
    char wait;
    std::cin >> wait;
}

本範例的主要工作在於物件的定義 array_view 以及對 parallel_for_each的呼叫。

結構向量 Description 被複製到一個 8x9 array_view 的物件中。
此方法使用tiled_extent物件作為計算域來呼叫parallel_for_each。 tiled_extent物件是透過呼叫extent::tile()變數的方法descriptions來建立的。呼叫 extent::tile()、 <2,3>的型別參數指定要建立 2x3 的圖塊。因此，8x9 矩陣被鋪成 12 個圖塊，四列三欄。
parallel_for_each此方法的呼叫方式是使用tiled_index<2,3>物件（）t_idx作為索引。索引（t_idx）的型別參數必須與計算域（descriptions.extent.tile< 2, 3>()）的型別參數相符。
執行每一個執行緒時，索引 t_idx 會回傳執行緒所在的圖塊資訊（tiled_index::tile 屬性）以及該執行緒在該圖塊中的位置（tiled_index::local 屬性）。

圖塊同步—tile_static與tile_barrier::wait

前一個範例展示了瓷磚的佈局和索引，但本身並不太實用。當磚塊是演算法的核心且能利用 tile_static 變數時，鋪磚才變得有用。由於圖塊中的所有執行緒都能存取 tile_static 變數，因此使用 tile_barrier::wait 來同步存取 tile_static 變數。雖然磚塊中的所有執行緒都能存取變 tile_static 數，但執行緒的執行順序並不保證。以下範例說明如何使用 tile_static 變數及 tile_barrier::wait 計算每塊圖塊平均值的方法。以下是理解這個例子的關鍵：

原始資料儲存在 8x8 矩陣中。
磁磚尺寸是 2x2。這樣就能建立一個 4x4 的方塊格子，而平均值可以透過物件 array 儲存在 4x4 矩陣中。 AMP 限制函數中，你能透過參考捕捉的類型數量有限。這個 array 類別就是其中之一。
矩陣大小與樣本大小是透過語#define句定義，因為、 array、 array_view和 extent 的tiled_index型別參數必須是常數值。你也可以使用 const int static 宣告。附加的好處是，調整樣本大小以計算 4x4 格子的平均值是輕而易舉的。
每個圖塊會宣告一個 tile_static 2x2 的浮點數陣列。雖然宣告包含在每個執行緒的程式碼路徑中，但矩陣中每個圖塊只會建立一個陣列。
有一行程式碼可以將每個圖塊的數值複製到陣列。tile_static 對於每個執行緒，當值被複製到陣列後，執行緒的執行會因呼叫 tile_barrier::wait而停止。
當一格中所有線都達到障礙時，可以計算平均值。因為程式碼對每個執行緒都執行，所以有一個 if 陳述式只計算一個執行緒的平均值。平均值儲存在平均值變數中。屏障本質上是控制磚塊計算的結構，類似於你在使用 for 迴圈。
變數中的 averages 資料因為是 array 物件，必須被複製回主機。此範例使用向量轉換運算子。
在完整範例中，你可以將 SAMPLESIZE 改為 4，程式碼就能正確執行，無需其他修改。

#include <iostream>
#include <amp.h>
using namespace concurrency;

#define SAMPLESIZE 2
#define MATRIXSIZE 8
void SamplingExample() {

    // Create data and array_view for the matrix.
    std::vector<float> rawData;
    for (int i = 0; i < MATRIXSIZE * MATRIXSIZE; i++) {
        rawData.push_back((float)i);
    }
    extent<2> dataExtent(MATRIXSIZE, MATRIXSIZE);
    array_view<float, 2> matrix(dataExtent, rawData);

    // Create the array for the averages.
    // There is one element in the output for each tile in the data.
    std::vector<float> outputData;
    int outputSize = MATRIXSIZE / SAMPLESIZE;
    for (int j = 0; j < outputSize * outputSize; j++) {
        outputData.push_back((float)0);
    }
    extent<2> outputExtent(MATRIXSIZE / SAMPLESIZE, MATRIXSIZE / SAMPLESIZE);
    array<float, 2> averages(outputExtent, outputData.begin(), outputData.end());

    // Use tiles that are SAMPLESIZE x SAMPLESIZE.
    // Find the average of the values in each tile.
    // The only reference-type variable you can pass into the parallel_for_each call
    // is a concurrency::array.
    parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
        [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp)
    {
        // Copy the values of the tile into a tile-sized array.
        tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
        tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

        // Wait for the tile-sized array to load before you calculate the average.
        t_idx.barrier.wait();

        // If you remove the if statement, then the calculation executes for every
        // thread in the tile, and makes the same assignment to averages each time.
        if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
            for (int trow = 0; trow < SAMPLESIZE; trow++) {
                for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                    averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
                }
            }
            averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
        }
    });

    // Print out the results.
    // You cannot access the values in averages directly. You must copy them
    // back to a CPU variable.
    outputData = averages;
    for (int row = 0; row < outputSize; row++) {
        for (int col = 0; col < outputSize; col++) {
            std::cout << outputData[row*outputSize + col] << " ";
        }
        std::cout << "\n";
    }
    // Output for SAMPLESIZE = 2 is:
    //  4.5  6.5  8.5 10.5
    // 20.5 22.5 24.5 26.5
    // 36.5 38.5 40.5 42.5
    // 52.5 54.5 56.5 58.5

    // Output for SAMPLESIZE = 4 is:
    // 13.5 17.5
    // 45.5 49.5
}

int main() {
    SamplingExample();
}

比賽條件

可能會很想建立一個名為 tile_statictotal 的變數，然後為每個執行緒遞增該變數，像這樣：

// Do not do this.
tile_static float total;
total += matrix[t_idx];
t_idx.barrier.wait();

averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE* SAMPLESIZE);

這種方法的第一個問題是 tile_static 變數不能有初始化器。第二個問題是，指派到 total的存在競賽條件，因為圖塊中的所有執行緒都能以無特定順序存取該變數。你可以寫一個演算法，讓每個障礙只允許一個執行緒存取總數，如下圖所示。然而，此解決方案無法擴充。

// Do not do this.
tile_static float total;
if (t_idx.local[0] == 0&& t_idx.local[1] == 0) {
    total = matrix[t_idx];
}
t_idx.barrier.wait();

if (t_idx.local[0] == 0&& t_idx.local[1] == 1) {
    total += matrix[t_idx];
}
t_idx.barrier.wait();

// etc.

記憶圍欄

必須同步的記憶體存取有兩種——全域記憶體存取與 tile_static 記憶體存取。物件 concurrency::array 只分配全域記憶體。 A concurrency::array_view 可以參考全域記憶體、 tile_static 記憶體，或兩者兼有，視其建構方式而定。必須同步的記憶體有兩種：

全域記憶
tile_static

記憶體圍欄確保記憶體存取對執行緒圖塊中的其他執行緒開放，且記憶體存取依照程式順序執行。為確保此點，編譯器和處理器不會在圍欄間重新排序讀寫。在 C++ AMP 中，透過呼叫以下方法之一建立記憶體圍欄：

tile_barrier：：wait 方法：在全域與 tile_static 記憶體之間建立一道圍欄。
tile_barrier：：wait_with_all_memory_fence 方法：在全域與 tile_static 記憶體之間建立一道圍欄。
tile_barrier::wait_with_global_memory_fence 方法：僅對全域記憶體設置屏障。
tile_barrier：：wait_with_tile_static_memory_fence 方法：只在記憶體周圍 tile_static 建立一道圍欄。

致電你所需的特定圍欄，可以提升應用程式的效能。障礙類型會影響編譯器與硬體的敘述重排方式。舉例來說，如果你使用全域記憶體圍欄，它只適用於全域記憶體存取，因此編譯器和硬體可能會重新排序圍 tile_static 欄兩側的變數讀寫。

在下一個例子中，屏障同步寫入到 tileValues，一個 tile_static 變數。在此範例中， tile_barrier::wait_with_tile_static_memory_fence 稱為而非 tile_barrier::wait。

// Using a tile_static memory fence.
parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
    [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp)
{
    // Copy the values of the tile into a tile-sized array.
    tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
    tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

    // Wait for the tile-sized array to load before calculating the average.
    t_idx.barrier.wait_with_tile_static_memory_fence();

    // If you remove the if statement, then the calculation executes
    // for every thread in the tile, and makes the same assignment to
    // averages each time.
    if (t_idx.local[0] == 0&& t_idx.local[1] == 0) {
        for (int trow = 0; trow <SAMPLESIZE; trow++) {
            for (int tcol = 0; tcol <SAMPLESIZE; tcol++) {
                averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
            }
        }
    averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE* SAMPLESIZE);
    }
});

另請參閱

C++ AMP （C++加速大規模平行處理原則）
tile_static 關鍵字

Last updated on 2018-11-19