CUDA-GPU programming Introduction (1)

2024-06-20 05:58

本文主要是介绍CUDA-GPU programming Introduction (1),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

基本定位:
CPU的并行是对于多任务的同时进行,task parallelism, 力求minimize latency,而GPU的并行是对于单任务的数据并行,data parallelism, 力求maximize throughout。CPU的组成有相当的部分作为控制和调度,GPU则主要是计算单元的堆积,large scale SIMD (Single Instruction Multiple Data)。

传统的GPU服务于图像处理,主要的特点就是流处理,stream computing,得益于大量的计算单元,可以对大量的相互独立的数据同时做计算。现代GPU更倾向于通用型。

Shared memory and thread synchronization primitives eliminate the need for data independence;
Gather and scatter operations allow kernels to read and write data at arbitrary locations.

CUDA programming model:
CPU作为主处理器被称为host, GPU作为协处理器,coprocessor,被称为device,host通过调用kernel, 将需要并行处理的大量计算扔给device。host和device有各自的memory,但是互相之间不能直接access,可以得是数据transfer。host负责自己的以及device的memory allocation,相互的data transfer,以及kernel的调用invocation。
示意图如下:
gpu diagram

基本硬件说明:
一个GPU包含多个multiprocessor,一个multiprocessor包含多个stream processor(SP),或者是多个core, (CUDA cores), 这些基本配置可以通过CUDA的devicequery来查看,比如,我的PC配的是很普通的NVIDIA GT630,如下:
gt630

在进行运算的时候,一个multiprocessor对应处理一个CUDA里设置的block,一个core对应于一个CUDA里设置的thread,现在可以大致这么理解。而所有的具体执行,都是以warp为单位的,warp大小最初为16 threads,后来就一直是32,直到现在。物理上来说,一个multiprocessor真正完全同时处理的线程数量就是warp size。一个block的所有线程共享这个MP的resources(register and shared memory)。

At runtime, a thread can determine the block that it belongs to, the block dimensions, and the thread index within the block。

关于thread和block的寻址我们稍微再讨论。

CUDA programming:

CUDA provides a set of extensions to the C programming
language
– new storage quantifiers, kernel invocation syntax, intrinsics, vector
types, etc.
• CUDA source code saved in .cu files
– host and device code and coexist in the same file
– storage qualifiers determine type of code
• Compiled to object files using nvcc compiler
– object files contain executable host and device code
• Can be linked with object files generated by other C/C++
compilers

例子:

__global__ void saxpy_gpu(float *vecY, float *vecX, float alpha ,int n)
{int i;i = blockIdx.x * blockDim.x + threadIdx.x;if (i<n)vecY[i] = alpha * vecX[i] + vecY[i];
}

key points:
1. The global qualifier identifies this function as a kernel that executes on the device.
2. blockIdx, blockDim and threadIdx are built-in variables that uniquely identify a thread’s position in the execution environment
– they are used to compute an offset into the data array
3. The host specifies the number of blocks and block size during
kernel invocation:

saxpy_gpu<<<numBlocks, blockSize>>>(y_d, x_d, alpha, n);

基本的寻址示意图:
这里写图片描述
key difference:
• No need to explicitly loop over array elements – each element is processed in a separate
thread
• The element index is computed based on block index, block width and thread index within
the block

basic scheme on host:

The host performs the following operations:
1. initialize device
2. allocate and initialize input arrays in host DRAM
3. allocate memory on device
4. upload input data to device
5. execute kernel on device
6. download results
7. check results
8. clean-up

example code:

#include <cuda.h> /* CUDA runtime API */
#include <cstdio>
int main(int argc, char *argv[])
{float *x_host, *y_host; /* arrays for computation on host*/float *x_dev, *y_dev; /* arrays for computation on device */float *y_shadow; /* host-side copy of device results */int n = 32*1024;float alpha = 0.5f;int nerror;size_t memsize;int i, blockSize, nBlocks;/* here could add some code to check if GPU device is present */memsize = n * sizeof(float);/* allocate arrays on host */x_host = (float *)malloc(memsize);y_host = (float *)malloc(memsize);y_shadow = (float *)malloc(memsize);/* allocate arrays on device */cudaMalloc((void **) &x_dev, memsize);cudaMalloc((void **) &y_dev, memsize);/* add checks to catch any errors *//* initialize arrays on host */for ( i = 0; i < n; i++){x_host[i] = rand() / (float)RAND_MAX;y_host[i] = rand() / (float)RAND_MAX;}/* copy arrays to device memory (synchronous) */cudaMemcpy(x_dev, x_host, memsize, cudaMemcpyHostToDevice);cudaMemcpy(y_dev, y_host, memsize, cudaMemcpyHostToDevice);/* set up device execution configuration */blockSize = 512;nBlocks = n / blockSize + (n % blockSize > 0);/* execute kernel (asynchronous!) */saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);/* could add check if this succeeded *//* execute host version (i.e. baseline reference results) */saxpy_cpu(y_host, x_host, alpha, n);/* retrieve results from device (synchronous) */cudaMemcpy(y_shadow, y_dev, memsize, cudaMemcpyDeviceToHost);/* ensure synchronization (cudaMemcpy is synchronous in most cases, but not all) */cudaDeviceSynchronize();/* check results */nerror=0;for(i=0; i < n; i++){if(y_shadow[i]!=y_host[i]) nerror=nerror+1;}printf("test comparison shows %d errors\n",nerror);/* free memory on device*/cudaFree(x_dev);cudaFree(y_dev);/* free memory on host */free(x_host);free(y_host);free(y_shadow);return 0;
} /* main */

Compiling:

• nvcc -arch=sm_20 -O2 program.cu -o program.x
• -arch=sm_20 means code is targeted at Compute Capability 2.0 architecture
• -O2 optimizes the CPU portion of the program

Be aware of memory bandwidth bottlenecks:
这里写图片描述

• The connection between CPU and GPU has low bandwidth
– need to minimize data transfers
– important to use asynchronous transfers if possible (overlap computation and transfer)

Using pinned memory:
• The transfer between host and device is very slow compared to access to memory within either the CPU or the GPU
• One way to speed it up by a factor of 2 or so is to use pinned memory on the host for memory allocation of array that will be transferred to the GPU

int main(int argc, char *argv[])
{cudaMallocHost((void **) &a_host, memsize_input)...cudaFree(a_host);
}

Timing GPU accelerated codes
• Presents specific difficulties because the CPU and GPU can be computing independently in parallel, i.e. asynchronously
• On the cpu can use standard function gettimeofday(…) (microsecond precision) and process the result
• If trying to time events on GPU with this function, must
ensure synchronization
• This can be done with a call to cudaDeviceSynchronize()
• Memory copies to/from device are synchronized, so can be used for timing.
• Timing GPU kernels on the CPU may be insufficiently accurate

Using mechanisms on the GPU for timing
• This is highly accurate on the GPU side, and very useful for optimizing kernels

sample code:

    ...cudaEvent_t start, stop;float kernel_timer;...cudaEventCreate(&start);cudaEventCreate(&stop);cudaEventRecord(start, 0);saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);cudaEventRecord(stop, 0);cudaEventSynchronize( stop );cudaEventElapsedTime( &kernel_timer, start, stop );printf("Test Kernel took %f ms\n",kernel_timer);cudaEventDestroy(start);cudaEventDestroy(stop);

这篇关于CUDA-GPU programming Introduction (1)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1077294

相关文章

AI Toolkit + H100 GPU,一小时内微调最新热门文生图模型 FLUX

上个月,FLUX 席卷了互联网,这并非没有原因。他们声称优于 DALLE 3、Ideogram 和 Stable Diffusion 3 等模型,而这一点已被证明是有依据的。随着越来越多的流行图像生成工具(如 Stable Diffusion Web UI Forge 和 ComyUI)开始支持这些模型,FLUX 在 Stable Diffusion 领域的扩展将会持续下去。 自 FLU

如何用GPU算力卡P100玩黑神话悟空?

精力有限,只记录关键信息,希望未来能够有助于其他人。 文章目录 综述背景评估游戏性能需求显卡需求CPU和内存系统需求主机需求显式需求 实操硬件安装安装操作系统Win11安装驱动修改注册表选择程序使用什么GPU 安装黑神话悟空其他 综述 用P100 + PCIe Gen3.0 + Dell720服务器(32C64G),运行黑神话悟空画质中等流畅运行。 背景 假设有一张P100-

GPU 计算 CMPS224 2021 学习笔记 02

并行类型 (1)任务并行 (2)数据并行 CPU & GPU CPU和GPU拥有相互独立的内存空间,需要在两者之间相互传输数据。 (1)分配GPU内存 (2)将CPU上的数据复制到GPU上 (3)在GPU上对数据进行计算操作 (4)将计算结果从GPU复制到CPU上 (5)释放GPU内存 CUDA内存管理API (1)分配内存 cudaErro

PyInstaller问题解决 onnxruntime-gpu 使用GPU和CUDA加速模型推理

前言 在模型推理时,需要使用GPU加速,相关的CUDA和CUDNN安装好后,通过onnxruntime-gpu实现。 直接运行python程序是正常使用GPU的,如果使用PyInstaller将.py文件打包为.exe,发现只能使用CPU推理了。 本文分析这个问题和提供解决方案,供大家参考。 问题分析——找不到ONNX Runtime GPU 动态库 首先直接运行python程序

麒麟系统安装GPU驱动

1.nvidia 1.1显卡驱动 本机显卡型号:nvidia rtx 3090 1.1.1下载驱动 打开 https://www.nvidia.cn/geforce/drivers/ 也可以直接使用下面这个地址下载 https://www.nvidia.com/download/driverResults.aspx/205464/en-us/ 1.1.3安装驱动 右击,

Kubernetes的alpha.kubernetes.io/nvidia-gpu无法限制GPU个数

问题描述: Pod.yaml文件中关于GPU资源的设置如下: 然而在docker中运行GPU程序时,发现宿主机上的两块GPU都在跑。甚至在yaml文件中删除关于GPU的请求,在docker中都可以运行GPU。 原因: 上例说明alpha.kubernetes.io/nvidia-gpu无效。查看yaml文件,发现该docker开启了特权模式(privileged:ture): 而

GPU池化赋能智能制造

2023年3月10日,“第六届智能工厂高峰论坛”在杭州隆重揭幕。本次会议由e-works数字化企业网、浙江制信科技有限公司主办,中国人工智能学会智能制造专业委员会、长三角新能源汽车产业链联盟、长三角(杭州)制造业数字化能力中心、浙江省智能工厂操作系统技术创新中心协办。趋动科技作为钻石合作伙伴出席了本次峰会,与制造业精英企业以及行业专业人士共同分享制造业在智能工厂推进过程中的成功经验,探讨工厂改进中

【linux 常用命令】查看gpu、显卡常用命令

1.查看显卡基本信息 lspci | grep -i nvidia 2.查看显卡驱动版本 nvidia-smi -a 3.查看gpu使用情况 nvidia-smi (spam) [dongli@dt-gpu-1 train]$ nvidia-smi Fri Sep 27 16:42:33 2019 +----------------------------------------

CUDA:用并行计算的方法对图像进行直方图均衡处理

(一)目的 将所学算法运用于图像处理中。 (二)内容 用并行计算的方法对图像进行直方图均衡处理。 要求: 利用直方图均衡算法处理lena_salt图像 版本1:CPU实现 版本2:GPU实现  实验步骤一 软件设计分析: 数据类型: 根据实验要求,本实验的数据类型为一个256*256*8的整型矩阵,其中元素的值为256*256个0-255的灰度值。 存储方式: 图像在内存中