Replacing C's memcpy with a D implementation

Sun Jun 10 18:16:42 UTC 2018

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
> I'm not experienced with this kind of programming, so I'm 
> doubting these results.  Have I done something wrong?  Am I 
> overlooking something?

You've just discovered the fact that one can rarely be careful 
enough with what is benchmarked, and having enough statistics.

For example, check out the following output from running your 
program on macOS 10.12, compiled with LDC 1.8.0:

---
$ ./test
memcpyD: 2 ms, 570 μs, and 9 hnsecs
memcpyDstdAlg: 77 μs and 2 hnsecs
memcpyC: 74 μs and 1 hnsec
memcpyNaive: 76 μs and 4 hnsecs
memcpyASM: 145 μs and 5 hnsecs
$ ./test
memcpyD: 3 ms and 376 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 104 μs and 4 hnsecs
memcpyNaive: 72 μs and 2 hnsecs
memcpyASM: 181 μs and 8 hnsecs
$ ./test
memcpyD: 2 ms and 565 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 73 μs and 2 hnsecs
memcpyNaive: 71 μs and 9 hnsecs
memcpyASM: 145 μs and 3 hnsecs
$ ./test
memcpyD: 2 ms, 813 μs, and 8 hnsecs
memcpyDstdAlg: 81 μs and 2 hnsecs
memcpyC: 99 μs and 2 hnsecs
memcpyNaive: 74 μs and 2 hnsecs
memcpyASM: 149 μs and 1 hnsec
$ ./test
memcpyD: 2 ms, 593 μs, and 7 hnsecs
memcpyDstdAlg: 77 μs and 3 hnsecs
memcpyC: 75 μs
memcpyNaive: 77 μs and 2 hnsecs
memcpyASM: 145 μs and 5 hnsecs
---

Because of the large amounts of noise, the only conclusion one 
can draw from this is that memcpyD is the slowest, followed by 
the ASM implementation.

In fact, memcpyC and memcpyNaive produce exactly the same machine 
code (without bounds checking), as LLVM recognizes the loop and 
lowers it into a memcpy. memcpyDstdAlg instead gets turned into a 
vectorized loop, for reasons I didn't investigate any further.

  — David