memcpy vs slice copy
Don
nospam at nospam.com
Tue Mar 17 01:12:55 PDT 2009
Christopher Wright wrote:
> bearophile wrote:
>> Don:
>>> Which means that memcpy probably isn't anywhere near optimal, either.<
>>
>> Time ago I have read an article written by AMD that shows that indeed
>> with modern CPUs there are ways to go much faster, using vector asm
>> instructions, loop unrolling and explicit cache prefetching (but it's
>> useful with longer arrays only. Profile-driven optimization can tell
>> you if a particular copy usually copies lot of data, and use such kind
>> of copy, that is overkill/slow/too much code for the cache for smaller
>> copies. As an alternative the programmer may add some annotation to
>> choose what copy strategy to use, but this is not nice).
Not necessary. If the length is long enough to get benefit from that,
the function call overhead for calling memcpy() is negligible. So you
just need to include all the cases in memcpy.
>> Bye,
>> bearophile
>
> You could probably get good results by seeing how long the array is,
> though.
Yes. If the length is known at compile time, and it's short, inline with
special-case code for the small sizes. If it's long, just put in a call
to memcpy.
I looked at the code in the DMD backend, the generation of the memcpy
intrinsic is in cdmemcpy() in cod2.c.
But as far as I can tell, by the time that code is called, the length
isn't a constant any more. So I don't think it'd be easy to fix.
On Core2,
rep movsb transfers 1 byte every 3 clocks -- ie 0.3 bytes per clock (!)
rep movsq/rep movsq transfers 1 dword/qword every 0.63 clocks -- ie best
case is 13 bytes/clock.
On AMD you get one transfer per clock -- max 8 bytes per clock, 1 byte
per clock with rep movsb.
So rep movsb is _unbelievably_ slow. A simple D for loop is probably
quicker.
More information about the Digitalmars-d
mailing list