Replacing C's memcpy with a D implementation

Mon Jun 11 04:44:00 UTC 2018

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

> memcpyD: 1 ms, 725 μs, and 1 hnsec
> memcpyD2: 587 μs and 5 hnsecs
> memcpyASM: 119 μs and 5 hnsecs
>
> Still, the ASM version is much faster.
>

rep movsd is very CPU dependend and needs some precondtions to be 
fast. For relative short memory blocks it sucks on many other CPU 
than the last Intel.

See what Agner Fog has to say about it:

16.10
String instructions (all processors)
String instructions without a repeat prefix are too slow and 
should be replaced by simpler instructions. The same applies to 
LOOP on all processors and to JECXZ
on some processors. REP MOVSD andREP STOSD are quite fast if the 
repeat count is not too small. Always use the largest word size 
possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make 
sure that both source and destination are aligned by the word 
size. In many cases, however, it is faster to use XMM registers. 
Moving data in XMM registers is faster than REP MOVSD and REP 
STOSD
in most cases, especially on older processors. See page 164 for 
details.
Note that while the REP MOVS instruction writes a word to the 
destination, it reads the next word from the source in the same 
clock cycle. You can have a cache bank conflict if bit 2-4 are 
the same in these two addresses on P2 and P3. In other words, you 
will get a penalty of one clock extra per iteration if ESI
+WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache 
bank conflicts is to align both source and destination by 8. 
Never use MOVSB or MOVSW
in optimized code, not even in 16-bit mode. On many processors, 
REP MOVS and REP STOS can perform fast by moving 16 bytes or an 
entire cache line at a time
. This happens only when certain conditions are met. Depending on 
the processor, the conditions for fast string instructions are, 
typically, that the count must
be high, both source and destination must be aligned, the 
direction must be forward, the distance between source and 
destination must be at least the cache line size, and the memory 
type for both source and destination must be either write-back or 
write-combining (you can normally assume the latter condition is 
met). Under these conditions, the speed is as high as you can 
obtain with vector register moves or even faster on some 
processors.
While the string instructions can be quite convenient, it must be 
emphasized that other solutions are faster in many cases. If the 
above conditions for fast move are not met then there is a lot to 
gain by using other methods. See page 164 for alternatives to REP 
MOVS