本文主要是介绍CUDA-GPU programming Introduction (4),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Concurrent execution and streams
GPU和CPU之间的并行性是不言而喻的,各自的计算是asynchronous的,任何时候如果需要同步这两者,都需要用到:
CudaDeviceSynchronize ()
对于GPU和CPU之间的memory copy来说,小数据量传输默认是asynchronous,大数据量传输则是synchronous的。但是我们可以加上后缀强迫所有内存读取异步进行:
cudaMemcpyAsync()
cudaMemcpyToSymbolAsync()
同时,如果出于debug的需要,我们想让所有的memory copy依次进行,可以将environment variable:
CUDA_LAUNCH_BLOCKING 设置为1.
Concurrency between different device operations (kernels and/or
memory copying) is a completely different story
现代GPU从硬件上来说,当然是允许多个device的操作并行的。操作指的是,kernels or/and memory copy。默认来说,所有的device操作都是serial的,不具有并行性。要利用并行性,得使用cuda的stream设置。
• A stream is a sequence of commands (possibly issued by different
host threads) that execute in order
• If stream ID is omitted, it is assumed to be “0” (default) stream. For
non-default streams, the IDs have to be used explicitly:
mykernel <<<Nblocks, Nthreads, 0, ID>>> ();
cudaMemcpyAsync (d_A, h_A, size, cudaMemcpyHostToDevice, ID);
stream在用之前必须像变量一样被声明并创建然后用完也得destroy。基本代码如下:
// Host code
cudaStream_t ID[2];
// Creating streams:
for (int i = 0; i < 2; ++i) cudaStreamCreate (&ID[i]);
// These two commands will run concurrently on GPU:
mykernel <<<Nblocks, Nthreads, 0, ID[0]>>> ();
cudaMemcpyAsync (d_A, h_A, size, cudaMemcpyHostToDevice, ID[1]);
// Destroying streams:
for (int i = 0; i < 2; ++i) cudaStreamDestroy (ID[i]);
但这种stream的做法也是有局限性的。
首先,memory copy这件事需要跟别的kernel或者反方向copy并行的话,host memory必须要是page-locked 或者说pinned,也就是host上申请内存不能是C的基本malloc而必须是:
cudaMallocHost()
host上的静态变量要想用pinned可以用:cudaHostRegister();
对于kernel来说,最多可以有16个kernel并行,而且这种并行性是无法保证的,因为当GPU资源有限的时候,如果多的kernel无法分配到需要的resource,并行也不会发生。
其他几个stream命令:
• cudaDeviceSynchronize() : global synchronization (across all the streams and the host);
• cudaStreamSynchronize (ID) : synchronize stream ID with the host;
• cudaStreamQuery (ID) : tests if the stream ID has finished running.
利用stream来帮助memory transfer between GPU and CPU。
情景一:
transfer和host computation并行
// On host:
// This memory copying will be asynchronous only in regards to the host code:
cudaMemcpyAsync (d_a, h_a, size, cudaMemcpyHostToDevice, 0);
// This host code will be executed in parallel with memory copying:
serial_computation ();
// The kernel will be executed after copying and serial code is done:
kernel <<<N, M>>> (d_a);
情景二:
transfer和kernel并行
// This memory copying will be asynchronous in regards to the host and stream ID[1]:
cudaMemcpyAsync (d_a, h_a, size, cudaMemcpyHostToDevice, ID[0]);
// The kernel doesn't need d_a, and will run concurrently:
kernel1 <<<N, M, 0, ID[1]>>> ();
// This kernel needs d_a, but doesn't need the result of kernel1; it will run after the
// Memcpy operation, and concurrently with kernel1:
kernel2 <<<N, M, 0, ID[0]>>> ();
这篇关于CUDA-GPU programming Introduction (4)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!