本文主要是介绍memory prefetch浅析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
最近在用vtune分析程序性能瓶颈时,发现一些内存访问的地方竟然成了cpu热点。经过仔细分析,发现这些热点主要是对大数组非连续位置的访问的引起的。比较消耗cpu的原因应该是cache不命中。因为像这样局部性很差的内存访问逻辑,对cache是很不友好的。于是想到了prefetch……
现象
#include <xmmintrin.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/time.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <math.h>void usage() {printf("usage: BIN file step prefetch\n");exit(1); }inline int calcu(int input) { #ifdef EMPTYCALCreturn input; #endifint val = (input % 99) * (input / 98);val = val ? val : 1; #ifdef HEAVYCALCdouble d = (double)input / (double)val;return (int)pow(d, 1999.9); #endifdouble n = sqrt(sqrt((double)(unsigned)input * 1.3));double m = sqrt(sqrt((double)(unsigned)val * 0.9));return (int)((double)input * (double)val * m / (n ? n : 1.1)); }int run_withprefetch(const int *array, int size, int step, int prefetch) {int result = 0;printf("run with prefetch(%d)...\n", prefetch);for (int i = 0; i < step; i++) {for (int j = i; j < size; j += step) {int k = j + step * prefetch;if (k < size) {_mm_prefetch(&array[k], _MM_HINT_T0);//const int *addr = &array[k];//__asm__ __volatile__ ("mov (%0), %%eax"::"r"(addr):"eax");//__asm__ __volatile__ ("mov %0, %%eax"::"r"(k):"eax");}result += calcu(array[j]);}}return result; }int run(const int *array, int size, int step) {int result = 0;printf("run...\n");for (int i = 0; i < step; i++) {for (int j = i; j < size; j += step) {//asm volatile("lfence");result += calcu(array[j]);}}return result; }int main(int argc, const char *argv[]) {if (argc != 4) {usage();}int step = atoi(argv[2]);int prefetch = atoi(argv[3]);int fd = open(argv[1], O_RDONLY);if (fd == -1) {usage();}struct stat st;int ret = fstat(fd, &st);if (ret != 0) {usage();}int array_size = st.st_size / sizeof(int);printf("array size: %d, step: %d. ", array_size, step);const int *array = (const int *)mmap(NULL, st.st_size, PROT_READ, MAP_POPULATE|MAP_SHARED, fd, 0);if (array == MAP_FAILED) {usage();}struct timeval tv1, tv2;gettimeofday(&tv1, NULL);int result = 0;if (prefetch == 0) {result = run(array, array_size, step);}else if (prefetch > 0) {result = run_withprefetch(array, array_size, step, prefetch);}gettimeofday(&tv2, NULL);tv2.tv_sec -= tv1.tv_sec;tv2.tv_usec -= tv1.tv_usec;if (tv2.tv_usec < 0) {tv2.tv_usec += 1000000;tv2.tv_sec--;}printf("time cost: %d.%06d, ", tv2.tv_sec, tv2.tv_usec);printf("result: %d\n", result);return result; }
$ ll -h test.tar.gz -rw-rw-r-- 1 xiangy xiangy 1.8G Jun 27 09:37 test.tar.gz $ g++ -O2 prefetch.cpp -DHEAVYCALC -o prefetch.heavy $ g++ -O2 prefetch.cpp -DEMPTYCALC -o prefetch.empty $ g++ -O2 prefetch.cpp -o prefetch.normal $ ./prefetch.normal usage: BIN file step prefetch
(选择不同复杂度的计算逻辑,编译成以不同后缀命名的可执行文件。)
[case-1]$ ./prefetch.empty test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 23.980005, result: 692002678
(空计算+跳读内存。预期内存访问基本上都是cache miss,而计算逻辑基本上又不花时间,所以最终花费的时间主要就是读内存时间。记住这个值,下面的case经常需要参考它。)
[case-2]$ ./prefetch.normal test.tar.gz 1 0 array size: 468787200, step: 1. run... time cost: 22.846302, result: 1309150882 [case-3]$ ./prefetch.normal test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 66.041256, result: 1309150882 [case-4]$ ./prefetch.normal test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 28.247350, result: 1309150882
(以上是普通计算的运行情况。case-2顺序读内存预期基本上都是cache hit的,最终花费的时间主要是执行计算逻辑的时间;case-3跳读内存时间花费大量增加;case-4加了预取之后,时间花费基本上恢复到case-2的水平。)
[case-5]$ ./prefetch.heavy test.tar.gz 1 0 array size: 468787200, step: 1. run... time cost: 47.386533, result: 1625037789 [case-6]$ ./prefetch.heavy test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 107.783801, result: 1625037789 [case-7]$ ./prefetch.heavy test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 51.492479, result: 1625037789
(以上是复杂计算的运行情况。跟前面的表现基本一致,跳读带来了大量的时间增长,而预取又基本恢复到顺序读时的水平。)
[case-8]$ ./prefetch.empty test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 24.253892, result: 692002678
(空计算+跳读内存,预取效果跟不加预取时差不多。)
[case-9]$ ./prefetch.normal test.tar.gz 1 4 array size: 468787200, step: 1. run with prefetch(4)... time cost: 22.896790, result: 1309150882
(普通计算+顺序读内存+预取,效果跟不加预取时也差不多。)
prefetch原理
load-1 load-1 calc-1 => calc-1 load-2load-2 calc-2 load-3calc-2 calc-3
但是实际上却又并非如此简单。
calc-11 calc-12 calc-13 calc-21calc-22calc-23 calc-31calc-32calc-33 calc-41calc-42calc-43 calc-51calc-52calc-53
(每个loop约花费2个单位时间。)
load-11 load-12 calc-11 calc-12 calc-13 load-21load-22calc-21calc-22 calc-23 load-31load-32calc-31calc-32calc-33
(每个loop约花费4个单位时间。注,假设CPU执行到calc-N3的时候才看到有load-(N+1)1。)
load-11 load-12 calc-11 prefetch-21 calc-12 prefetch-22 calc-13 calc-21 prefetch-31calc-22 prefetch-32calc-23 calc-31 prefetch-41calc-32 prefetch-42calc-33 calc-41 prefetch-51calc-42 prefetch-52calc-43 calc-51calc-52calc-53
(每个loop约花费2个单位时间。注,在calc-N1之前prefetch-(N+X)1就已经发起了。)
例子引出的问题
int run(const int *array, int size, int step) {int result = 0;printf("run...\n");for (int i = 0; i < step; i++) {for (int j = i; j < size; j += step) {asm volatile("lfence");result += calcu(array[j]);}}return result; }
再次运行case-1:
[case-1.1]$ g++ -O2 prefetch.cpp -DEMPTYCALC [case-1.1]$ ./a.out test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 63.999864, result: 692002678
(强制让load不能并行之后,case-1.1的耗时直接变成了case-3的水平。说明在原本的case-1中load是存在很大的并行度的。)
[case-6.1]$ g++ -O2 prefetch.cpp -DHEAVYCALC [case-6.1]$ ./a.out test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 114.787379, result: 1625037789
(果然如此。)
[case-2.2]$ ./prefetch.normal test.tar.gz 1 0 | ./prefetch.normal test.tar.gz 1 0 array size: 468787200, step: 1. run... time cost: 22.964594, result: 1309150882
(两个顺序读内存的普通计算一起运行,因为总是cache hit,所以跟单个运行的时间差不多。)
[case-1.2]$ ./a.out test.tar.gz 1024 0 | ./a.out test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 63.973557, result: 692002678
(两个加了lfence的进程一起运行,由于进程内的内存访问已经串行化了,两个进程可以各自使用一个内存通道,所以运行时间跟单个进程运行时差不多。)
[case-1.3]$ ./prefetch.empty test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 24.083044, result: 692002678 [case-1.4]$ ./prefetch.empty test.tar.gz 1024 0 | ./prefetch.empty test.tar.gz 1024 0 array size: 468787200, step: 1024. run... time cost: 37.948864, result: 692002678
(而用之前没加过lfence的程序再试一下,两个进程同时运行时,由于争抢内存带宽,运行时间就会受影响。)
prefetch提前量
$ for x in 1 2 4 8 16 32; do ./prefetch.normal test.tar.gz 1024 $x; done array size: 468787200, step: 1024. run with prefetch(1)... time cost: 36.262511, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(2)... time cost: 29.902517, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 28.052798, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(8)... time cost: 26.040215, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(16)... time cost: 26.198825, result: 1309150882 array size: 468787200, step: 1024. run with prefetch(32)... time cost: 25.910506, result: 1309150882
prefetch提前量从1增大到32。从结果看,当提前量小的时候,prefetch效果不明显。为什么呢?
用mov代替prefetch?
int run_withprefetch(const int *array, int size, int step, int prefetch) {int result = 0;printf("run with prefetch(%d)...\n", prefetch);for (int i = 0; i < step; i++) {for (int j = i; j < size; j += step) {int k = j + step * prefetch;if (k < size) {const int *addr = &array[k];asm volatile("mov (%0), %%eax"::"r"(addr):"eax");}result += calcu(array[j]);}}return result; }
重跑case-4:
[case-4.1]$ g++ -O2 prefetch.cpp [case-4.1]$ ./a.out test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 37.312423, result: 1309150882
确实比不加prefetch的情况case-3(66.041256)要好很多,但还是比不上原来的case-4(28.247350)。
if (k < size) {_mm_prefetch(&array[k], _MM_HINT_T0);__asm__ __volatile__ ("mov %0, %%eax"::"r"(k):"eax");}
[case-4.2]$ g++ -O2 prefetch.cpp [case-4.2]$ ./a.out test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 28.051848, result: 1309150882
可见仅仅多占用一个寄存器,貌似并没有什么影响。(在tomasulo算法中,这里实际上并没有多占用寄存器,而是多占用了RS。)
[case-6.2]$ g++ -O2 prefetch.cpp -DHEAVYCALC [case-6.2]$ ./a.out test.tar.gz 1024 4 array size: 468787200, step: 1024. run with prefetch(4)... time cost: 100.910547, result: 1625037789
果然,这个结果比起原本的case-6(107.783801)已经没有优势了。
其他问题
关于硬件prefetch
$ for x in 1 2 4 8 16 32 64 128 256 512 1024; do ./prefetch.normal test.tar.gz $x 0; done array size: 468787200, step: 1. run... time cost: 22.863716, result: 1309150882 array size: 468787200, step: 2. run... time cost: 23.438035, result: 1309150882 array size: 468787200, step: 4. run... time cost: 23.846166, result: 1309150882 array size: 468787200, step: 8. run... time cost: 24.171723, result: 1309150882 array size: 468787200, step: 16. run... time cost: 25.502980, result: 1309150882 array size: 468787200, step: 32. run... time cost: 37.461018, result: 1309150882 array size: 468787200, step: 64. run... time cost: 39.829086, result: 1309150882 array size: 468787200, step: 128. run... time cost: 44.291904, result: 1309150882 array size: 468787200, step: 256. run... time cost: 65.332225, result: 1309150882 array size: 468787200, step: 512. run... time cost: 64.764071, result: 1309150882 array size: 468787200, step: 1024. run... time cost: 65.952260, result: 1309150882
随着step的逐步增大,可以看出时间消耗分为三个档次:
关于TLB cache miss
int run(const int *array, int size, int step, int blocksize) {int result = 0;blocksize *= 4096;if (blocksize == 0) {blocksize = size;}printf("run... (block=%d pages)\n", blocksize/4096);int start = 0;int nsize = blocksize;while (nsize == blocksize) {if (start + blocksize > size)nsize = size - start;for (int i = 0; i < step; i+=32) {for (int j = i; j < nsize; j += step) {int thissize = j + 32 < nsize ? j + 32 : nsize;for (int k = j; k < thissize; k++) {result += calcu(array[start+k]);}}}start += nsize;}return result; }
改造run函数把整个文件按blocksize划分成若干个块,每个块单独完成跳读逻辑。
$ for x in 128 256 512 1024 0; do ./a.out test.tar.gz 1024 $x; done array size: 468787200, step: 1024. run... (block=128 pages) time cost: 22.501363, result: 1309150882 array size: 468787200, step: 1024. run... (block=256 pages) time cost: 22.627935, result: 1309150882 array size: 468787200, step: 1024. run... (block=512 pages) time cost: 25.064514, result: 1309150882 array size: 468787200, step: 1024. run... (block=1024 pages) time cost: 24.976720, result: 1309150882 array size: 468787200, step: 1024. run... (block=114450 pages) time cost: 24.900870, result: 1309150882
看起来TLB cache miss所带来的影响不大。
关于L1 cache
int run(const int *array, int size, int step, int blocksize) {int result = 0;blocksize *= 4096;if (blocksize == 0) {blocksize = size;}printf("run... (block=%d pages)\n", blocksize/4096);int start = 0;int nsize = blocksize;while (nsize == blocksize) {if (start + blocksize > size)nsize = size - start;for (int i = 0; i < step; i++) {for (int j = i; j < nsize; j += step) {result += calcu(array[start+j]);}}start += nsize;}return result; }
在一个块内反复跳读,如果以块的大小刚好能被cache住,则程序运行时间会很短;否则又得经历漫长的读内存过程。
$ for x in 1 2 4 8 16 32 64 128 256 512 1024; do ./a.out test.tar.gz 1024 $x; done array size: 468787200, step: 1024. run... (block=1 pages) time cost: 1.614654, result: 692002678 array size: 468787200, step: 1024. run... (block=2 pages) time cost: 1.554286, result: 692002678 array size: 468787200, step: 1024. run... (block=4 pages) time cost: 1.625566, result: 692002678 array size: 468787200, step: 1024. run... (block=8 pages) time cost: 2.621453, result: 692002678 array size: 468787200, step: 1024. run... (block=16 pages) time cost: 2.697908, result: 692002678 array size: 468787200, step: 1024. run... (block=32 pages) time cost: 2.724401, result: 692002678 array size: 468787200, step: 1024. run... (block=64 pages) time cost: 2.710056, result: 692002678 array size: 468787200, step: 1024. run... (block=128 pages) time cost: 3.864916, result: 692002678 array size: 468787200, step: 1024. run... (block=256 pages) time cost: 4.241000, result: 692002678 array size: 468787200, step: 1024. run... (block=512 pages) time cost: 20.216653, result: 692002678 array size: 468787200, step: 1024. run... (block=1024 pages) time cost: 24.361176, result: 692002678
随着block size的逐渐增大,程序运行时间显现出三个层次。分别代表着L1 cache hit(1〜4)、L2 cache hit(8〜256)、cache miss(512〜)三种状态。看起来L1 cache在本例中影响并不大。
这篇关于memory prefetch浅析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!