Replacing C's memcpy with a D implementation

Mon Jun 11 18:17:49 UTC 2018

BTW the way memcpy is(was?) implemented in the C runtime coming 
from the Inter C++ compiler was really enlightening on the sheer 
difficulty of such a task.

First of all there isn't one loop but many depending on the 
source and destination alignment.

- If both are aligned on 16-byte boundaries, source and 
destination operand would be with MOVAPS/MOVDQA, nothing special
- If only the source or destination was misaligned, the function 
would dispatch to a variant with the core loop loading 16-byte 
aligned and writing 16-byte unaligned, with the PALIGNR 
instruction. However, since PALIGNR can't take a runtime value, 
this variant was _replicated 16 times_.
- I don't remember for both source and destination misaligned but 
you can degenerate this case to the above one.

Each of this loop had complicated loop preludes that do the first 
iteration, and they are so hard to do by hand.

It was also the only piece of assembly I've seen that 
(apparently) successfully used the "prefetch" instructions.

This was just the SSE version, AVX was different.

I don't know if someone really wrote this code, or if it was all 
from intrinsics.