xmemcpy改进版

2024-02-26 02:48
文章标签 改进版 xmemcpy

本文主要是介绍xmemcpy改进版,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

xmemcpy改进版,利用movdqu速度快的特点,利用内联和常量化来提高对于小内存的memcpy性能优化

xmemcpy来自github beyondszine/progs/C/c_progs/memcpy.c ,不知道是不是原作者,这里进行了部分改进

------2016-2-28注意1:以下内容的缓冲区由于反复读取,总在L1cache中,类似于栈内存,如果总是在超出cache的内存中,则由于内存速度拖累,改进版与memcpy很难拉开差距,但是仍然有一定的效果

------2016-2-28注意2:DEBUG下速度会很慢,除非关闭/GS或用 #pragma runtime_checks( "s", restore ) (此编译杂注对模板无效)

------2016-3-5  注意3:参看zmemcpy改进版,对debug模式有相当大的提高 http://blog.csdn.net/superzmy/article/details/50810343


预期结果:

All time to memcpy 80 * 100M is 0.248s in 3GHz (xmemcopy)
All time to memcpy 80 * 100M is 0.476s in 3GHz (xmemcpy)
All time to memcpy 80 * 100M is 0.778s in 3GHz (xmemcpy unknownSize)
All time to memcpy 80 * 100M is 0.232s in 3GHz (movdq)
All time to memcpy 80 * 100M is 0.257s in 3GHz (movdq  unalign)
All time to memcpy 81 * 100M is 0.298s in 3GHz (movdq)
All time to memcpy 81 * 100M is 0.264s in 3GHz (movdq unalign)
All time to memcpy 400 * 100M is 1.334s in 3GHz (xmemcopy)
All time to memcpy 400 * 100M is 1.236s in 3GHz (xmemcopy unalign)
All time to memcpy 400 * 100M is 1.819s in 3GHz (xmemcpy)
All time to memcpy 400 * 100M is 3.051s in 3GHz (rep movs)
All time to memcpy 400 * 100M is 2.984s in 3GHz (rep movs unalign)
All time to memcpy 400 * 100M is 3.015s in 3GHz (rep movs handwrite asm)
All time to memcpy 401 * 100M is 3.093s in 3GHz (rep movs)
All time to memcpy 401 * 100M is 3.193s in 3GHz (rep movs handwrite asm)
All time to memcpy 80 * 100M is 1.216s in 3GHz (rep movs handwrite asm)
All time to memcpy 4000 * 100M is 15.254s in 3GHz (rep movs handwrite asm)
All time to memcpy 80 * 100M is 1.824s in 3GHz (call _memcpy)
All time to memcpy 81 * 100M is 1.828s in 3GHz (call _memcpy)
All time to memcpy 81 * 100M is 1.779s in 3GHz (call _memcpy unalign)
All time to memcpy 400 * 100M is 2.554s in 3GHz (call _memcpy)
All time to memcpy 401 * 100M is 2.777s in 3GHz (call _memcpy)
All time to memcpy 401 * 100M is 2.725s in 3GHz (call _memcpy unalign)
All time to memcpy 4000 * 100M is 14.379s in 3GHz (call _memcpy)
以上代码vs2013编译 E3 1230V2上运行

// ConsoleApplication3.cpp : 定义控制台应用程序的入口点。
//#include "stdafx.h"
#include <windows.h>
#include <intrin.h>
#include <assert.h>
char data80[80] = "abcdefghijklmnopqrstuvwxyz0123456789";char data400[400] = 
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"012345678901234567890123456789012345678";char data4000[4000] =
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
;char data401[401] =
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
"abcdefghijklmnopqrstuvwxyz0123456789"
;
char data81[81] = "abcdefghijklmnopqrstuvwxyz0123456789";// optimize memcpy less than 120bytes
// char a[32], b[32]; a = b;  is faster than memcpy(a, b, sizeof(b));namespace com
{const static size_t _MAXSIZE_ = 80;extern void* (*g_base[_MAXSIZE_+1])(void *dest, const void *src);
};inline void *xmemcpy(void *dest, const void *src, size_t len);namespace com
{template <size_t size>struct xmemcpy_t{int data[size];};template <>struct xmemcpy_t<0>{};template <size_t size>class xmemcopy{public:inline static void * copy(void *dest, const void *src){if (size > _MAXSIZE_){size_t i = 0;for (; i + _MAXSIZE_ <= size; i += _MAXSIZE_)xmemcopy<_MAXSIZE_>::copy((char*)dest + i, (const char*)src + i);if (size % _MAXSIZE_) xmemcopy<size % _MAXSIZE_>::copy((char*)dest + i, (const char*)src + i);return dest;}typedef xmemcpy_t<((size - 1) % _MAXSIZE_ + 1) / sizeof(int)> type_t;*((type_t *)dest) = *((type_t *)src);if ((size%sizeof(int)) > 0) {((char *)dest)[size - 1] = ((char *)src)[size - 1];}if ((size%sizeof(int)) > 1) {((char *)dest)[size - 2] = ((char *)src)[size - 2];}if ((size%sizeof(int)) > 2) {((char *)dest)[size - 3] = ((char *)src)[size - 3];}return dest;}};template <>class xmemcopy<0>{public:static void * copy(void *dest, const void *src) { return dest; }};void* (*g_base[_MAXSIZE_+1])(void *dest, const void *src);template <size_t len>void init() {g_base[len] = xmemcopy<len>::copy;init<len - 1>();}template <>void init<0>() {g_base[0] = xmemcopy<0>::copy;}struct xmem_monitor{xmem_monitor() {init<_MAXSIZE_>();}};static xmem_monitor g_monitor;
}inline void *xmemcpy(void *dest, const void *src, size_t len)
{if (len <= com::_MAXSIZE_) {return com::g_base[len](dest, src);}else if (len <= com::_MAXSIZE_ * 10){size_t i = 0;for (; i + com::_MAXSIZE_ < len; i += com::_MAXSIZE_)com::xmemcopy<com::_MAXSIZE_>::copy((char*)dest + i, (const char*)src + i);com::g_base[len - i]((char*)dest + i, (const char*)src + i);return dest;}return ::memcpy(dest, src, len);
}int _tmain(int argc, _TCHAR* argv[])
{SetProcessAffinityMask(GetCurrentProcess(), 2);char buffer[10000] = {};com::xmemcopy<com::_MAXSIZE_ * 2>::copy(buffer, data400);if (memcmp(buffer, data400, com::_MAXSIZE_ * 2))__asm int 3;com::xmemcopy<com::_MAXSIZE_ * 2 + 1>::copy(buffer, data400);if (memcmp(buffer, data400, com::_MAXSIZE_ * 2 + 1))__asm int 3;com::xmemcopy<400>::copy(buffer, data400);if(memcmp(buffer, data400, 400))__asm int 3;char* volatile pb = buffer;char* volatile pb1 = buffer + 1;size_t volatile size40 = sizeof(data80);size_t volatile size41 = sizeof(data81);assert((int)pb % 4 == 0);assert((int)pb1 % 4 == 1);assert((int)data80 % 8 == 0);assert((int)data400 % 8 == 0);assert((int)data4000 % 8 == 0);for (int i = 0; i < 10; ++i){memcpy(pb, data80, size40);memcpy(pb, data81, size41);memcpy(pb, data400, sizeof(data400));memcpy(pb, data401, sizeof(data401));memcpy(pb, data4000, sizeof(data4000));}printf("\n");enum { Count = 100000000 };
#if(1){auto& dest = data80;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)com::xmemcopy<sizeof(dest)>::copy(pb, dest);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (xmemcopy)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data80;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)xmemcpy(pb, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (xmemcpy)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data80;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)xmemcpy(pb, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (xmemcpy unknownSize)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data80;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (movdq)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data80;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb1, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (movdq  unalign)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data81;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (movdq)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data81;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb1, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (movdq unalign)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}//	{auto& dest = data400;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)com::xmemcopy<sizeof(dest)>::copy(pb, dest);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (xmemcopy)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data400;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)com::xmemcopy<sizeof(dest)>::copy(pb1, dest);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (xmemcopy unalign)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}
#endifmemset(pb, 0, 400);{auto& dest = data400;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)xmemcpy(pb, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (xmemcpy)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data400;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data400;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb1, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs unalign)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data400;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i){__asm{mov         edi, dword ptr[pb];mov         ecx, size data400 / 4;mov         esi, dest;rep movs    dword ptr es : [edi], dword ptr[esi];}}t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs handwrite asm)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data401;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, sizeof(dest));t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data401;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i){__asm{mov         edi, dword ptr[pb];mov         ecx, size data401 / 4;mov         esi, dest;rep movs    dword ptr es : [edi], dword ptr[esi];movs        byte ptr es : [edi], byte ptr[esi]}}t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs handwrite asm)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data80;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i){__asm{mov         edi, dword ptr[pb];mov         ecx, size data80 / 4;mov         esi, dest;rep movs    dword ptr es : [edi], dword ptr[esi];}}t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs handwrite asm)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data4000;__int64 t = __rdtsc();for (int i = 0; i < Count; ++i){__asm{mov         edi, dword ptr[pb];mov         ecx, size data4000 / 4;mov         esi, dest;rep movs    dword ptr es : [edi], dword ptr[esi];}}t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (rep movs handwrite asm)\n", sizeof(dest), Count / 1000000, t / 3000000000.0);}{auto& dest = data80;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy)\n", size, Count / 1000000, t / 3000000000.0);}{auto& dest = data81;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy)\n", size, Count / 1000000, t / 3000000000.0);}{auto& dest = data81;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb1, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy unalign)\n", size, Count / 1000000, t / 3000000000.0);}{auto& dest = data400;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy)\n", size, Count / 1000000, t / 3000000000.0);}{auto& dest = data401;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy)\n", size, Count / 1000000, t / 3000000000.0);}{auto& dest = data401;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb1, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy unalign)\n", size, Count / 1000000, t / 3000000000.0);}{auto& dest = data4000;size_t volatile size = sizeof(dest);__int64 t = __rdtsc();for (int i = 0; i < Count; ++i)memcpy(pb, dest, size);t = __rdtsc() - t;printf("All time to memcpy %d * %dM is %0.3fs in 3GHz (call _memcpy)\n", size, Count / 1000000, t / 3000000000.0);}return 0;
}





这篇关于xmemcpy改进版的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/747498

相关文章

Python爬虫-贝壳二手房“改进版”

前言 本文是该专栏的第31篇,后面会持续分享python爬虫干货知识,记得关注。 在本专栏之前的文章《Python爬虫-贝壳二手房》中,笔者有详细介绍,基于python爬虫采集对应城市的二手房数据。 而在本文,笔者将基于该项目案例的基础上,进行一个项目代码的“改进版”。 具体实现思路和详细逻辑,笔者将在正文结合完整代码进行详细介绍。废话不多说,跟着笔者直接往下看正文详细内容。(附带完

详解FedProx:FedAvg的改进版 Federated optimization in heterogeneous networks

FedProx:2020 FedAvg的改进 论文:《Federated Optimization in Heterogeneous Networks》 引用量:4445 源码地址: 官方实现(tensorflow)https://github.com/litian96/FedProx 几个pytorch实现:https://github.com/ki-ljl/FedProx-PyTorch ,

android Listview分批加载+自动加载(改进版)(附源码下载)

这次在代码上比上一次改进了一些,并且加载完全部数据后会在lisview底部显示一个“已加载全部“的textview。大家可以对比我的上一篇博客 android Listview分批加载+自动加载(附源码下载)看看代码有啥不同 直接上代码: public class TestForListviewActivity extends Activity implementsOnScroll

关于怎么用Cubemx生成的USBHID设备实现读取一体的鼠标键盘设备(改进版)

主要最近做了一个要用STM32实现读取鼠标键盘一体的那种USB设备,STM32的界面上要和电脑一样的能通过这个USB接口实现鼠标移动,键盘的按键。然后我就很自然的去参考了正点原子的例程,可是找了一圈,发现正点原子好像用的库函数,还是自己实现的,然后看了半天都看晕了,感觉自己实现不了,然后就主攻Cubemx实现的USB设备读取了。 在网上找了一圈,终于让我发现了一个可以用的博主的,而且实现了USB

JAVA——实现字符流的练习之复制Java文件改进版

* 字符流的练习之复制Java文件改进版  *   * InputStreamReader --> FileReader  * OutputStreamWriter --> FileWriter  *   *   * FileReader : 字符输入流  *         public FileReader(String fileName) :   *   * FileWriter : 字符输

第十一周项目五:当年第几天(改进版)

问题及代码: /**Copyright (c) 2014,烟台大学计算机学院*ALL right reserved*文件名;当年第几天.cpp*作者;童宇*完成日期2014年11月11*版本号v1.0*问题描述:输入一个年月日,输出这一天为该年的第几天。*输入描述:输入一个年月日。*程序输出:输出这一天为该年的第几天。*/#include <iostream>using

sqlites数据库读取,仅适用于数据少的数据库查询【改进版】

直接使用运行程序时加入数据库名和表明,直接完成查询 如:编译后的可执行程序为test,则在命令行输入 ./test test.db name 代码如下: #include <stdio.h>#include <sqlite3.h>static int callback(void *data, int argc, char **argv, char **azColName) {int

改进版的 setdest

ns自带setdest函数只能针对所有节点设置移动速度。如果我想让其中的10个节点移动速度较快,而另外40个节点较慢,自带的setdest就无能为力了。 为了克服这个问题,自己写了个setdest。贴上来大家一起讨论。 # ====================================================================== # default value

C语言钟表【改进版】

改进版源代码: #include<stdio.h>#include<graphics.h>#include<math.h>#include<dos.h>#define PI 3.1415926#define x 320#define y 240int main(){ int gdriver = DETECT,gmode,i,l; float th_hour,th_min,th_sec,m,n,

c语言扫雷改进版

目录 文章目录 主体 整体架构流程 技术名词解释 技术细节 测试情况   文章目录 概要整体架构流程技术名词解释技术细节测试情况   主体  主体包括菜单,游戏规则简绍,选择进行与否 int main(){int input;srand((unsigned int)time(NULL));do{ menu();scanf("%d", &input);switc