Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL
Help

4. Tiling > Effects of Tile Size

Effects of Tile Size

For the matrix multiplication example used in this chapter, the larger the tiles, the more times each value in the tile_static memory gets used. So the obvious question is: should you automatically use the largest possible tile size?

To test tile size, here is the structure of a simple application that initializes arrays with random float values and runs the same calculation multiple ways:

int main()
{
    const int M = 64;
    const int N = 512;
    const int W = 256;

    std::vector<float> vA(M * W);
    std::vector<float> vB(W * N);
    std::vector<float> vC(M * N);
    std::vector<float> vRef(M * N);

    std::random_device rd;
    std::default_random_engine engine(rd());
    std::uniform_real_distribution<float> rand(0.0f, 1.0f);

    std::generate(vA.begin(), vA.end(), [&rand, &engine](){ return rand(engine); });
    std::generate(vB.begin(), vB.end(), [&rand, &engine](){ return rand(engine); });

    // Calculate a reference result on the CPU for comparison.
    for (int row = 0; row < M; ++row)
    {
        for (int col = 0; col < N; ++col)
        {
            float result = 0.0f;
            for (int i = 0; i < W; ++i)
            {
                int idxA = row * W + i;
                int idxB = i * N + col;
                result += vA[idxA] * vB[idxB];
            }
            vRef[row * N + col] = result;
        }
    }

    MatrixMultiply(vC, vA, vB, M, N, W);

    MatrixMultiplyTiled(vC, vA, vB, M, N, W);

    return 0;
}

(For space reasons, the code to query performance counters and report results was elided in this example, along with code to check that the results are the same in all three calculations.) This code was run for a variety of matrix sizes and tile sizes. The largest possible tile size here is 32 because 32 x 32 = 1,024, which is the limit for the product of the tile dimensions. On a system with a fairly ordinary video card, these times were recorded (in milliseconds):

 

C++ AMP, simple

Tiled, TileSize=4

Tiled, TileSize=8

Tiled, TileSize=16

Tiled, TileSize=32

M=64, N=4096, W=64

17

39

14

13

13

M=128, N=4096, W=128

33

135

30

25

26

M=256, N=4096, W=256

90

522

96

73

80

M=512, N=4096, W=512

307

2015

330

235

266

Here are a few notes on timing, which will be addressed more thoroughly in Chapter 7. First, any time over two seconds is not meaningful; by default, the process is stopped after two seconds of execution on the GPU. (You can learn more about this time-out in the Time-Out Detection and Recovery section of Chapter 12. Second, when running executables to time them, be sure to run them several times in quick succession. The first run is usually much slower than subsequent ones. Ignore any timing results from your first runs. Chapter 7 and Chapter 8, have more details on how to time your code correctly.

The CPU implementation here is simple and doesn’t use memory very efficiently, so no CPU results are included. With these numbers, you can compare simple (nontiled) C++ AMP results with those for various tile sizes.

The first conclusion is that tile sizes that are too small can produce results that are worse than the simple case. Tiles of 4 x 4 do not reuse the tile_static data enough to make up for the cost of copying, and they make inefficient use of the GPU architecture by not occupying a full multiprocessor. For the arrays used in these sample runs, tiles of 8 x 8 manage to just break even, although because the code is more complex, it doesn’t seem worth the effort. Tiles of 16 x 16 look like the sweet spot from this set of runs, with a definite improvement over the simple solution. It seems—although more runs would be required to be sure—that 32 is consistently a little worse than 16 for array sizes in this range. For a different problem or even a different set of array sizes, you’ll see different optimal tile sizes. It’s even possible that you might find different optimal tile sizes on different video cards, which can make choosing tile size rather challenging.

What won’t vary from problem to problem is the poor performance that results from choosing a tile size of 4 x 4 or 8 x 8. To understand why, consider two aspects of GPU and C++ AMP architecture that were mentioned in Chapter 1.

A GPU typically queues thousands of threads for work and does so in clumps or bundles. NVIDIA calls these a warp, for example, and they are 16 or 32 threads. AMD calls them a wavefront. The word “warp” will be used here.

Elements of an array that differ only in their least significant index are next to each other in memory (for example, A00 and A01 in the diagrams in this chapter).

The fastest and most efficient arrangement for a GPU is when all the threads in the warp are accessing consecutive memory and performing the same operations on that memory. If the number of threads in the tile is smaller than the warp size, then different threads within the warp are in different tiles and are accessing entirely different areas of memory; the threads have diverged. This is inefficient. Different hardware devices have different warp sizes—typically 32 and 64—but in the future this may change and 64 will probably be more common. As a result, you will not see top performance when you choose a tile size that makes the number of threads per tile (the product of the dimensions) smaller than 32 or 64. The same effect can happen with tiles where the number of threads is not a multiple of 32 or 64 because they will involve “remainders” that take up only part of a vector-wide warp. For example, 40 threads in a tile might mean 32 threads in one warp and 8 in another. That causes some of the same issues as a tile size of four.

Further, if the tile size in the least significant dimension is at least the size of the warp (mostly 32 today and 64 in the future), all the threads in the warp will be accessing consecutive memory locations, and that’s the sweet spot for GPU work. In the examples shown here, both the 16 x 16 and 32 x 32 tile-size runs gain this benefit, which you can see is significant. When you use the right tile size, your tiled calculations will be much faster than your nontiled ones, and that gives you the maximum performance gain from using C++ AMP.