CUDA-GPU programming Introduction (4)

本文主要是介绍CUDA-GPU programming Introduction (4)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Concurrent execution and streams

GPU和CPU之间的并行性是不言而喻的，各自的计算是asynchronous的，任何时候如果需要同步这两者，都需要用到：

CudaDeviceSynchronize ()

对于GPU和CPU之间的memory copy来说，小数据量传输默认是asynchronous，大数据量传输则是synchronous的。但是我们可以加上后缀强迫所有内存读取异步进行：

cudaMemcpyAsync()
cudaMemcpyToSymbolAsync()

同时，如果出于debug的需要，我们想让所有的memory copy依次进行，可以将environment variable：
CUDA_LAUNCH_BLOCKING 设置为1.

Concurrency between different device operations (kernels and/or
memory copying) is a completely different story

现代GPU从硬件上来说，当然是允许多个device的操作并行的。操作指的是，kernels or/and memory copy。默认来说，所有的device操作都是serial的，不具有并行性。要利用并行性，得使用cuda的stream设置。

• A stream is a sequence of commands (possibly issued by different
host threads) that execute in order
• If stream ID is omitted, it is assumed to be “0” (default) stream. For
non-default streams, the IDs have to be used explicitly:

mykernel <<<Nblocks, Nthreads, 0, ID>>> ();
cudaMemcpyAsync (d_A, h_A, size, cudaMemcpyHostToDevice, ID);

stream在用之前必须像变量一样被声明并创建然后用完也得destroy。基本代码如下：

// Host code
cudaStream_t ID[2];
// Creating streams:
for (int i = 0; i < 2; ++i) cudaStreamCreate (&ID[i]);
// These two commands will run concurrently on GPU:
mykernel <<<Nblocks, Nthreads, 0, ID[0]>>> ();
cudaMemcpyAsync (d_A, h_A, size, cudaMemcpyHostToDevice, ID[1]);
// Destroying streams:
for (int i = 0; i < 2; ++i) cudaStreamDestroy (ID[i]);

但这种stream的做法也是有局限性的。
首先，memory copy这件事需要跟别的kernel或者反方向copy并行的话，host memory必须要是page-locked 或者说pinned，也就是host上申请内存不能是C的基本malloc而必须是:

cudaMallocHost()

host上的静态变量要想用pinned可以用：cudaHostRegister();
对于kernel来说，最多可以有16个kernel并行，而且这种并行性是无法保证的，因为当GPU资源有限的时候，如果多的kernel无法分配到需要的resource，并行也不会发生。
其他几个stream命令：

• cudaDeviceSynchronize() : global synchronization (across all the streams and the host);
• cudaStreamSynchronize (ID) : synchronize stream ID with the host;
• cudaStreamQuery (ID) : tests if the stream ID has finished running.

利用stream来帮助memory transfer between GPU and CPU。
情景一：
transfer和host computation并行

// On host:
// This memory copying will be asynchronous only in regards to the host code:
cudaMemcpyAsync (d_a, h_a, size, cudaMemcpyHostToDevice, 0);
// This host code will be executed in parallel with memory copying:
serial_computation ();
// The kernel will be executed after copying and serial code is done:
kernel <<<N, M>>> (d_a);

情景二：
transfer和kernel并行

// This memory copying will be asynchronous in regards to the host and stream ID[1]:
cudaMemcpyAsync (d_a, h_a, size, cudaMemcpyHostToDevice, ID[0]);
// The kernel doesn't need d_a, and will run concurrently:
kernel1 <<<N, M, 0, ID[1]>>> ();
// This kernel needs d_a, but doesn't need the result of kernel1; it will run after the
// Memcpy operation, and concurrently with kernel1:
kernel2 <<<N, M, 0, ID[0]>>> ();

这篇关于CUDA-GPU programming Introduction (4)的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！