Replacing C's memcpy with a D implementation
Guillaume Piolat
first.last at gmail.com
Mon Jun 11 18:17:49 UTC 2018
BTW the way memcpy is(was?) implemented in the C runtime coming
from the Inter C++ compiler was really enlightening on the sheer
difficulty of such a task.
First of all there isn't one loop but many depending on the
source and destination alignment.
- If both are aligned on 16-byte boundaries, source and
destination operand would be with MOVAPS/MOVDQA, nothing special
- If only the source or destination was misaligned, the function
would dispatch to a variant with the core loop loading 16-byte
aligned and writing 16-byte unaligned, with the PALIGNR
instruction. However, since PALIGNR can't take a runtime value,
this variant was _replicated 16 times_.
- I don't remember for both source and destination misaligned but
you can degenerate this case to the above one.
Each of this loop had complicated loop preludes that do the first
iteration, and they are so hard to do by hand.
It was also the only piece of assembly I've seen that
(apparently) successfully used the "prefetch" instructions.
This was just the SSE version, AVX was different.
I don't know if someone really wrote this code, or if it was all
from intrinsics.
More information about the Digitalmars-d
mailing list