Replacing C's memcpy with a D implementation

Sun Jun 10 23:39:13 UTC 2018

On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
> On 6/10/2018 11:16 AM, David Nadlinger wrote:
>> Because of the large amounts of noise, the only conclusion one 
>> can draw from this is that memcpyD is the slowest,
>
> Probably because it does a memory allocation.

Of course; that was already pointed out earlier in the thread.

> The CPU makers abandoned optimizing the REP instructions 
> decades ago, and just left the clunky implementations there for 
> backwards compatibility.

That's not entirely true. Intel started optimising some of the 
REP string instructions again on Ivy Bridge and above. There is a 
CPUID bit to indicate that (ERMS?); I'm sure the Optimization 
Manual has further details. From what I remember, `rep movsb` is 
supposed to beat an AVX loop on most recent Intel µarchs if the 
destination is aligned and the data is longer than a few cache 
lines. I've never measured that myself, though.

  — David