Replacing C's memcpy with a D implementation
Patrick Schluter
Patrick.Schluter at bbox.fr
Mon Jun 11 04:44:00 UTC 2018
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
> On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
> memcpyD: 1 ms, 725 μs, and 1 hnsec
> memcpyD2: 587 μs and 5 hnsecs
> memcpyASM: 119 μs and 5 hnsecs
>
> Still, the ASM version is much faster.
>
rep movsd is very CPU dependend and needs some precondtions to be
fast. For relative short memory blocks it sucks on many other CPU
than the last Intel.
See what Agner Fog has to say about it:
16.10
String instructions (all processors)
String instructions without a repeat prefix are too slow and
should be replaced by simpler instructions. The same applies to
LOOP on all processors and to JECXZ
on some processors. REP MOVSD andREP STOSD are quite fast if the
repeat count is not too small. Always use the largest word size
possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make
sure that both source and destination are aligned by the word
size. In many cases, however, it is faster to use XMM registers.
Moving data in XMM registers is faster than REP MOVSD and REP
STOSD
in most cases, especially on older processors. See page 164 for
details.
Note that while the REP MOVS instruction writes a word to the
destination, it reads the next word from the source in the same
clock cycle. You can have a cache bank conflict if bit 2-4 are
the same in these two addresses on P2 and P3. In other words, you
will get a penalty of one clock extra per iteration if ESI
+WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache
bank conflicts is to align both source and destination by 8.
Never use MOVSB or MOVSW
in optimized code, not even in 16-bit mode. On many processors,
REP MOVS and REP STOS can perform fast by moving 16 bytes or an
entire cache line at a time
. This happens only when certain conditions are met. Depending on
the processor, the conditions for fast string instructions are,
typically, that the count must
be high, both source and destination must be aligned, the
direction must be forward, the distance between source and
destination must be at least the cache line size, and the memory
type for both source and destination must be either write-back or
write-combining (you can normally assume the latter condition is
met). Under these conditions, the speed is as high as you can
obtain with vector register moves or even faster on some
processors.
While the string instructions can be quite convenient, it must be
emphasized that other solutions are faster in many cases. If the
above conditions for fast move are not met then there is a lot to
gain by using other methods. See page 164 for alternatives to REP
MOVS
More information about the Digitalmars-d
mailing list