Best interface for memcpy() (and the string.h family of functions)

Thu May 30 00:55:54 UTC 2019

On Wednesday, 29 May 2019 at 23:27:35 UTC, Jonathan Marler wrote:
>
> I haven't benchmarked it yet but here's the changes I've made 
> to my standard library to also take advantage of alignment 
> guarantees from typed pointers and arrays.
>
> https://github.com/dragon-lang/mar/commit/bb096d2d4f489d47177f6a678b1d9bab756e3dc7
>

Good, this week I'm also working on alignment. (more 
specifically, mis-alignment).
Since you took the time anyway to play with alignment, you might 
find
SIMD instructions useful.
Take a look at Mike's memcpyD. My yesterday toy SIMD that 
surpassed
libc memcpy was as simple as:

static foreach(i; 0 .. T.sizeof/32) {
     // Assuming RDI is 'dst' and RSI 'src'
     asm pure nothrow @nogc {
      	vmovdqa YMM0, [RDI+i*32];
         vmovdqa [RSI+i*32], YMM0;
     }
}
/* instead of
static foreach(i; 0 .. T.sizeof/32)
{
     memcpyD((cast(S!32*)dst) + i, (cast(const S!32*)src) + i);
}
*/

Again, really simple and dumb, but effective. A couple of notes, 
so that you
don't have the headaches I had:
1) You can use `vmovdqu` (notice the 'u' at the end) for 
unaligned memory and
skip note 2.
2) `vmovdqa` assumes 32-byte aligned memory. Now, `align()` is 
kind of
buggy, so if you have a normal buffer on the stack that you want 
to align, that:
align(32) ubyte[32768] buf;
won't work.
One solution is to allocate memory on heap and do slight pointer 
arithmetic
to have it aligned.

Last minute discovery:
Haha, the compiler flags I used were: -mcpu=avx -inline
With these flags, memcpyD is faster.
_Removing_ -inline resulted in faster code for libc memcpy. I'll 
have to look
close tomorrow.
(Oh, and the libc memcpy, it seems from disasm, achieves these 
results with sse3, so 128-bit instructions. I mean.. at least 
impressive).