Pull the bullshit, write a faster memcpy

Write code sometimes like religious belief, once the collapse, is the most painful thing. Early I read cloud wind a "VC of memcpy" and "Efficiency optimization, geek 2: copying data in C/C++, optimisation", so I could hardly believe that write the C runtime library faster memcpy. But recently there are two things, let me cast doubt on the conviction that.

The first one is the most recently in lz4 code, lz4 is probably the fastest memory compression algorithm, partial evaluation he than snappy hurry (lz4 implementation for this analysis.). Study on his code, found him one important and other code in different place is his memory copy is used in a macro, instead of using the memcpy. The direct use of uint64_t conversion pointer for the copy assignment, personal estimate this is a place for him to speed up the processing speed. The copy code following effect, we don't care about length overflow part.

 1 //You must ensure that DST has enough space, the space requirement is probably larger than that of SZ, it is 8 byte complement. 
 2 #define ZEN_TEST_FAST_COPY(dst,src,sz)  {\
 3     char *_cpy_dst = dst; \
 4     const  char *_cpy_src = src; \
 5     size_t _cpy_size = sz;\
 6     do \
 7 { \
 8     ZBYTE_TO_UINT64(_cpy_dst) = ZBYTE_TO_UINT64(_cpy_src); \
 9     _cpy_dst += sizeof(uint64_t); \
10     _cpy_src += sizeof(uint64_t); \
11 }while( _cpy_size > sizeof(uint64_t) && (_cpy_size -= sizeof(uint64_t))); \
12 }

The second is to look at an article "which memcpy faster? ". Found that the Linux standard library memcpy. This is the subversion of the. The original code, there is a question, which in the beginning of the non integral part of the operation. But if memcpy DST, SRC parameter address is aligned, so clearly not conducive to speed up. I modified the code as follows:

 1 void *ZEN_OS::fast_memcpy(void *dst, const void *src, size_t sz)
 2 {
 3     void *r = dst;
 4 
 5     //Copy, the length of the uint64_t in general, memory addresses are aligned, 
 6     size_t n = sz & ~(sizeof(uint64_t) - 1);
 7     uint64_t *src_u64 = (uint64_t *) src;
 8     uint64_t *dst_u64 = (uint64_t *) dst;
 9 
10     while (n)
11     {
12         *dst_u64++ = *src_u64++;
13         n -= sizeof(uint64_t);
14     }
15 
16     //Part copy will not 8 byte length integer
17     n = sz & (sizeof(uint64_t) - 1);
18     uint8_t *src_u8 = (uint8_t *) src;
19     uint8_t *dst_u8 = (uint8_t *) dst;
20     while (n-- )
21     {
22         (*dst_u8++ = *src_u8++);
23     }
24 
25     return r;
26 }

The code inside a similar function, difference, copy 2 uint64_t length of data in each cycle. In order to facilitate the formulation called fast_memcpy[2].

1     while (n)
2     {
3         *dst_u64++ = *src_u64++;
4         *dst_u64++ = *src_u64++;
5         n -= sizeof(uint64_t)*2;
6     }

In order to understand exactly how, only their own test. Test copy 8, 16… … 64K, 1M, 4M bytes of data. In LINUX 64 (GCC 4.3 O3), Windows7 X64 (Visual 2010 realse), Windwos7 Win32 (realse) environment has been tested, the test adopts high precision timer data collection. .

The first group of test data to byte aligned, boring data will not stick, directly on the map. If the speed compared to the 100% slowest, others as his relative ratio and draw pictures, lines have been following certainly better:

Linux 64 (GCC 4.3 under the O3 optimization), copy speed contrast byte aligned case, as the following diagram,

clip_image002

Windows7 X64,Visaul C++ 2010 Realse, Copy speed byte aligned case, as the following diagram

clip_image004

Look at the above comparative picture will know memcpy in Qi Qing, memcpy is a good choice at any time.

But the memory copy another common, is not aligned bytes, and lz4 as a compression algorithm, probably will often face not aligned, so I in this case also made a test.

Linux 64 (GCC 4.3 under the O3 optimization), byte aligned non copying speed case, as the following diagram,

clip_image006

Windows7 X64,Visaul C++ 2010 Realse, Non aligned bytes case copy speed, as shown in the following figure, macros copy

clip_image008

In the case of bytes are not aligned, and a copy of the memory length of less than 256 bytes, 8 bytes used the assignment rate will be slightly better than memcpy, which is probably why lz4 using this method. If the copy of larger size error, memcpy is a better choice. But considering that lz4 mostly face copy bytes, should be less than 256 bytes. So he used the theory of macro copy way is to gain some advantage.

Conclusion:

In the case of memcpy byte aligned, in almost any time is the best choice.

Indeed some method under certain conditions a little faster than memcpy. But if you don't know how to choose, choose memcpy as default. Similarly to the memset function.

On the platform of Windows, if the data length up to 1M or 4M, Windows's performance is better than the Linxu platform, especially in the 64 bit platform. (of course, the Linux test environment is the virtual machine, whether there is a certain effect?) So far, possible causes, and frostburn broke off estimation Instruction Optimization Windows may have had GCC.

Write a runtime library function memcpy faster, which in itself is a relatively new things, your opponent has a compiler optimization, code optimization, Instruction Optimization means. "What memcpy is faster? "A paper should be an error. Of course, he may have his background (without optimization? ), the author may not say.

Reference documents and background reading:

Optimization of memcpy "VC cloud wind interpretation of VC for various copy length, the optimizing compiler.

Efficiency geek 2: copying data in C/C++, optimisation Methods the Geek to do an assessment, of course, the first part memset to write more details. Some Geek practices we can have a look. "The various versions of the memcpy (the underlying optimization)" ask to explain some data copy optimization method. "Memcpy C Optimizing Memcpy improves database operation speed" the inside of the memcpy is not what we say, but this article also explains the speed data copy method.

Which memcpy faster? "To mislead my article, I think he said things should have a special scene. Or he says C library memcpy and I understand it is not a thing?

C/C++ tip: How to copy memory quickly The paper also illustrates the problem, his conclusion like us.

[The writer is Yanduhantan, in the spirit of freedom, you can complete reprint without profit to this document, reproduced please attach the BLOG link:, or a dollar per word, per Figure one hundred not bargain. The Baidu library and 360doc double the fare]


Posted by Webster at April 08, 2014 - 4:02 PM