Tiled Matrix Multiplication

2023-12-06 21:44
文章标签 matrix multiplication tiled

本文主要是介绍Tiled Matrix Multiplication,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

if(true) {
(function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,‘script’,‘https://www.google-analytics.com/analytics.js’,‘ga’);
}
if (typeof ga === “function”) {
ga(‘create’, ‘UA-141814063-1’, ‘auto’, {});

  }</script><link rel="shortcut icon" href="/icons-734a781301b6d50527ac54b7d31cec57/favicon.ico"/><link rel="icon" type="image/png" sizes="16x16" href="/icons-734a781301b6d50527ac54b7d31cec57/favicon-16x16.png"/><link rel="icon" type="image/png" sizes="32x32" href="/icons-734a781301b6d50527ac54b7d31cec57/favicon-32x32.png"/><link rel="manifest" href="/icons-734a781301b6d50527ac54b7d31cec57/manifest.json"/><meta name="mobile-web-app-capable" content="yes"/><meta name="theme-color" content="#fff"/><meta name="application-name" content="gatsby-starter-hello-world"/><link rel="apple-touch-icon" sizes="57x57" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-57x57.png"/><link rel="apple-touch-icon" sizes="60x60" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-60x60.png"/><link rel="apple-touch-icon" sizes="72x72" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-72x72.png"/><link rel="apple-touch-icon" sizes="76x76" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-76x76.png"/><link rel="apple-touch-icon" sizes="114x114" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-114x114.png"/><link rel="apple-touch-icon" sizes="120x120" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-120x120.png"/><link rel="apple-touch-icon" sizes="144x144" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-144x144.png"/><link rel="apple-touch-icon" sizes="152x152" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-152x152.png"/><link rel="apple-touch-icon" sizes="167x167" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-167x167.png"/><link rel="apple-touch-icon" sizes="180x180" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-180x180.png"/><link rel="apple-touch-icon" sizes="1024x1024" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-1024x1024.png"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"/><meta name="apple-mobile-web-app-title" content="gatsby-starter-hello-world"/><link rel="apple-touch-startup-image" media="(device-width: 320px) and (device-height: 480px) and (-webkit-device-pixel-ratio: 1)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-320x460.png"/><link rel="apple-touch-startup-image" media="(device-width: 320px) and (device-height: 480px) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-640x920.png"/><link rel="apple-touch-startup-image" media="(device-width: 320px) and (device-height: 568px) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-640x1096.png"/><link rel="apple-touch-startup-image" media="(device-width: 375px) and (device-height: 667px) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-750x1294.png"/><link rel="apple-touch-startup-image" media="(device-width: 414px) and (device-height: 736px) and (orientation: landscape) and (-webkit-device-pixel-ratio: 3)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1182x2208.png"/><link rel="apple-touch-startup-image" media="(device-width: 414px) and (device-height: 736px) and (orientation: portrait) and (-webkit-device-pixel-ratio: 3)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1242x2148.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: landscape) and (-webkit-device-pixel-ratio: 1)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-748x1024.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: portrait) and (-webkit-device-pixel-ratio: 1)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-768x1004.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: landscape) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1496x2048.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: portrait) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1536x2008.png"/><link as="script" rel="preload" href="/webpack-runtime-3ee88fe8cebd4e0cceb4.js"/><link as="script" rel="preload" href="/app-dfd9346b7f706d9fba43.js"/><link as="script" rel="preload" href="/styles-35d409c61f58f7a61639.js"/><link as="script" rel="preload" href="/1-24d16269159e89d0a015.js"/><link as="script" rel="preload" href="/2-749140a4d16844d8356b.js"/><link as="script" rel="preload" href="/14-8136f547fe9ea611aa29.js"/><link as="script" rel="preload" href="/component---src-templates-blog-js-63ca91c014e5a8445e33.js"/><link as="fetch" rel="preload" href="/static/d/125/path---blog-tiled-matrix-multiplication-1-fe-a65-awprJ6fzP1FAubeRzmVHIJujH88.json" crossorigin="use-credentials"/></head><body><noscript id="gatsby-noscript">This app works best with JavaScript enabled.</noscript><div id="___gatsby"><div style="outline:none" tabindex="-1" role="group"><div class="layout-module--container--2TGku"><div class="layout-module--content--3nIku"><header class="header-module--header--3A712"><h1><a class="header-module--title--33kOg" href="/"> <!-- -->Penny Xu<!-- --> </a></h1><nav><ul class="header-module--nav-list--87D9u"><li><a class="header-module--nav-item--1ixk0" href="/"> Home </a></li><li><a class="header-module--nav-item--1ixk0" href="/blog"> Blog </a></li><li><a class="header-module--nav-item--1ixk0" href="/about"> About Me </a></li><li><a class="header-module--nav-item--1ixk0" href="/contact"> Contact </a></li></ul></nav></header><h1>Tiled Matrix Multiplication</h1><p>2019-08-24</p><nav><ul class="tags-module--nav-list--i1shz"><p class="tags-module--tag--2-fKt"> <svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="tags" class="svg-inline--fa fa-tags fa-w-20 fa-xs " role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 640 512"><path fill="currentColor" d="M497.941 225.941L286.059 14.059A48 48 0 0 0 252.118 0H48C21.49 0 0 21.49 0 48v204.118a48 48 0 0 0 14.059 33.941l211.882 211.882c18.744 18.745 49.136 18.746 67.882 0l204.118-204.118c18.745-18.745 18.745-49.137 0-67.882zM112 160c-26.51 0-48-21.49-48-48s21.49-48 48-48 48 21.49 48 48-21.49 48-48 48zm513.941 133.823L421.823 497.941c-18.745 18.745-49.137 18.745-67.882 0l-.36-.36L527.64 323.522c16.999-16.999 26.36-39.6 26.36-63.64s-9.362-46.641-26.36-63.64L331.397 0h48.721a48 48 0 0 1 33.941 14.059l211.882 211.882c18.745 18.745 18.745 49.137 0 67.882z"></path></svg> Tags: </p><li><a class="tags-module--nav-item--20nSV" href="/tags/deep-learning/">deep learning</a></li><li><a class="tags-module--nav-item--20nSV" href="/tags/matrix-multiplication/">matrix multiplication</a></li><li><a class="tags-module--nav-item--20nSV" href="/tags/cuda/">CUDA</a></li><li><a class="tags-module--nav-item--20nSV" href="/tags/parallelism/">parallelism</a></li></ul></nav><hr/><div><p>Let's talk about tiled matrix multiplication today. This is an algorithm performed on GPUs due to the parallel nature of matrix multiplication. We will especially look at a method called "tiling," which is used to reduce global memory accesses by taking advantage of the shared memory on the GPU. Tiling can be seen as a way to boost execution efficiency of the kernel. We will then examine the CUDA kernel code that do exactly what we see in the visualization, which shows what each thread within a block is doing to compute the output.</p>

Keep in mind that this post is not meant to teach you CUDA coding, but rather it is meant to help viewers gain some visual intuition on what each thread is doing in a basic tiled matrix multiplication algorithm. I strongly believe that writing the code (launching the kernel, index calculations...) will come easily if you understand and see what you are trying to code.

Why should you care?

The efficiency of calculating matrix multiplication is the backbone of everything. Everything as in rendering graphics and machine learning. Ever heard of Tensors? Yeah...everything is matrix multiplication I swear.

Some background

The main idea of using GPUs for computation is simple. The idea is to get more work done in less time. Imagine you have an assignment with 4 math problems to solve, each problem taking 1 hour. You can spend 4 hours and do all 4 problems by yourself. But what if you have 3 other friends with the same assignment? Then you tell your friends to each solve 1 problem and then you all will share the solutions...because sharing is caring. This means in 1 hour, your assignment would be finished.

To finish off this analogy, each one of your friends is a worker, or an unit of execution, a thread. When you have a lot of workers (threads) to manage, you might want to organize them in a way. Below is the organization of threads in CUDA terms.

  • Thread: single unit of execution --- each thread has its own memory called registers
  • Block: group of threads --- all threads in a block has access to a shared memory called shared memory
  • Grid: group of blocks --- all threads in a grid has access to global memory and constant memory

Problem setup

Given a 4x4 input matrix A and a 4x4 input matrix B, I want to calculate a 4x4 output matrix C. Since C consists of 16 elements, where each element is computed through a dot product of a row of A and a column of B, then let's launch 16 threads, where each thread calculates 1 output element. For the sake of this example, let's say the threads is organized into a 2x2 block, and there are 4 blocks in a grid.

Visualization

Let's see what each thread within each block is doing. From the visualization below, you can see that each thread is responsible for loading input elements into the shared memory. Remember that shared memory is shared within each block. This means that each of the four threads in a block in this example can see what the other three threads loaded into share A and share B. You can see that we are essentially doing mini-matrix multiplication using shared memory, storing the temporary result somewhere, and then continue summing the temporary results of the next mini-matrix multiplication. When we are finished with each individual mini-matrix multiplication, each thread would load their corresponding result to the output C element that they are mapped to. Keep in mind that we are only looking at the threads in one block, don't forget that all the other threads in the other three blocks are also doing their version of the calculations AT THE SAME TIME. Just think about it...

Let's compare global memory accesses with and without tiling. A global memory access is accessing elements of either input A or input B.

Without tiling: In order to calculate one output element, a thread will need to access one entire row of input A and one entire column of input B, for calculating the dot product. In our example, that is 8 accesses per thread.

With tiling: Each thread ends up loading two elements from input A and two elements from input B, which totals up to 4 accesses per thread.

In general, reduction in global memory accesses in tiling matrix multiplication is proportional to the dimension of the blocks used. This means that with blocks size of NxN, the potential reduction of global memory traffic would be N. So in our example, since we used 2x2 blocks, we can see that global memory accesses with tiling is 1/2 of the global memory accesses without tiling.

Kernel code

Many details of the coding is abstracted away with the explanation above, since I want you to build an intuitive visual understanding of what is going on. However, I will include the CUDA kernel code below if you are curious! Try and match each numbered line with the visualization!

  • In the code below, the width that is in the argument is the width of the output C.

  • You can see that the number "2" is used throughout the code, this is to match our simple example above, as it is the block width or tile width.

  • threadIdx is specific to a block, and blockIdx is specific to a grid. Since our matrix multiplication example has a two dimensional output, then it is easiest to organize the threads in 2D. So the four threads in a block is actually indexed like thread00, thread01, thread10, thread11, where the first and second number corresponds to the row number and the column index within its block. This is also the case for how each block (block00, block01, block10, block11) is indexed in this example.

  • __syncthreads() is a barrier synchronization line that says no threads can continue execution of the remaining code until all threads have reached that point in their execution. This is super important for the correctness of this algorithm.

    __global__ tile_matrix_multiply(float* A, float* B, float* C, int width)
    
      1.   __shared__ shareA[2][2];2.   __shared__ shareB[2][2];3.   int bx = blockIdx.x; int by = blockIdx.y;4.   int tx = threadIdx.x; int ty = threadIdx.y;5.   int row = by * 2 + ty;6.   int col = bx * 2 + tx;7.   float temp = 0;8.   for(int i = 0; i &#x3C; width/2; ++i){9.       shareA[ty][tx] = A[row*width + (i*2 + tx)];10.      shareB[ty][tx] = B[(i*2 + ty)*width + col];11.      __syncthreads();12.     for(int k = 0; k &#x3C; 2; ++k){13.       temp += shareA[ty][k] * shareB[k][tx];14.       __syncthreads();}}15.  C[row*width + col] = temp;
    

Please do not hesitate to comment down below if you have any questions regarding the visualization or code!


Created by Penny Xu , © 2020

这篇关于Tiled Matrix Multiplication的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/463441

相关文章

[论文笔记]LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

引言 今天带来第一篇量化论文LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale笔记。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 大语言模型已被广泛采用,但推理时需要大量的GPU内存。我们开发了一种Int8矩阵乘法的过程,用于Transformer中的前馈和注意力投影层,这可以将推理所需

73. Set Matrix Zeros

题目: 解答: 提供了两种解题思路: 第一种,使用两个数组,分别标记每一行、每一列是否有0的存在,然后再去更新二维数组。 第二种,使用两个变量brow,bcol分别标记第0行,第0列是否存在0,然后使用每一行、每一列的第一个单元存储是否该行、该列存在0. 代码: class Solution {public:// 方法一void setZeroes(vector<vector<i

Error: label vector and instance matrix must be double的解决方法

在使用uci下载的数据时,建模时出现这个错误的解决方法 首先现在UCI上面下载数据 然后右键另存为就行了。这样我们就从UCI里面下载到了训练数据 在matlab 点 导入数据,数据类型要记得选第二个, 如果选择最后一个table就会出现这个问题 最后附上代码 %%之前先import wine.date IMPORTED DATA 设为Numeric Matrix (数值矩

python 实现matrix exponentiation矩阵求幂算法

matrix exponentiation矩阵求幂算法介绍 矩阵求幂算法(Matrix Exponentiation)是一种通过利用矩阵乘法的结合律来高效地计算矩阵的幂的算法。这种方法特别适用于在算法竞赛和计算机科学领域中解决需要快速计算矩阵幂的问题,如求解线性递推关系、图论中的路径计数等。 基本思想 矩阵求幂算法的基本思想类似于整数快速幂算法(快速幂算法),通过递归或迭代的方式将矩阵幂的计

翻译论文的关键部分 | Parallel Tiled QR Factorization for Multicore Architectures

SSRFB DTSQT2 DLARFB DGEQT2 1, 对角子矩阵分解 DGEQT2 这个例程被开发出来,用于针对对角Tile子矩阵: ,执行不分块的QR分解。 这个运算产生: 一个上三角矩阵 一个酉下三角矩阵,这个矩阵包含 b 个 Householder 反光面、 一个上三角矩阵 ,在WY技术中,这个矩阵被定义用来累计Householder变换。 和 能够写进 所占据的内存空间,

[LeetCode] 240. Search a 2D Matrix II

题:https://leetcode.com/problems/search-a-2d-matrix-ii/description/ 题目 Write an efficient algorithm that searches for a value in an m x n matrix. This matrix has the following properties: Integers i

[LeetCode] 566. Reshape the Matrix

题:https://leetcode.com/problems/reshape-the-matrix/description/ 题目 In MATLAB, there is a very useful function called ‘reshape’, which can reshape a matrix into a new one with different size but keep

UVa 11992 Fast Matrix Operations 线段树

UVa 11992 Fast Matrix Operations 题目大意:有一个r行c列的全0矩阵,支持三种操作: 1 x1 y1 x2 y2 v 子矩阵(x1,y1,x2,y2)的所有元素增加v(v > 0)。 2 x1 y1 x2 y2 v 子矩阵(x1,y1,x2,y2)的所有元素设为v(v > 0)。 3 x1 y1 x2 y2    查询子矩阵(x1,y1,x2,y2

【HDU】4965 Fast Matrix Calculation 矩阵快速幂

传送门:【HDU】4965 Fast Matrix Calculation 题目分析:因为比赛的时候写的太匆忙。。写的不堪入目,所以赛后重写了一次,顺便就贴一下了。 因为A*B=C,所以C^(N*N-1) = A*B*A*B*A*...*B*A*B,因为满足结合律所以变成A*( (B*A)^(N*N-2) )*B,因为中间得到的矩阵最大不超过K(K<=6),所以可以对中间的矩阵快速幂,然

01 Matrix

Input:[[0,0,0],[0,1,0],[1,1,1]]Output:[[0,0,0],[0,1,0],[1,2,1]] Given a matrix consists of 0 and 1, find the distance of the nearest 0 for each cell. The distance between two adjacent cells is 1.