Replacing C's memcpy with a D implementation

Sun Jun 10 22:23:08 UTC 2018

On 6/10/2018 11:16 AM, David Nadlinger wrote:
> Because of the large amounts of noise, the only conclusion one can draw from 
> this is that memcpyD is the slowest,

Probably because it does a memory allocation.

> followed by the ASM implementation.

The CPU makers abandoned optimizing the REP instructions decades ago, and just 
left the clunky implementations there for backwards compatibility.

> In fact, memcpyC and memcpyNaive produce exactly the same machine code (without 
> bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. 
> memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't 
> investigate any further.

This amply illustrates my other point that looking at the assembler generated is 
crucial to understanding what's happening.