memcpy vs slice copy
Don
nospam at nospam.com
Mon Mar 16 06:53:00 PDT 2009
Sergey Gromov wrote:
> Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:
>
>> Sergey Gromov wrote:
>>> Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:
>>>
>>>> On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
>>>>
>>>>> While doing some string processing I've seen some unusual timings
>>>>> compared to the C code, so I have written this to see the situation
>>>>> better. When USE_MEMCPY is false this little benchmark runs about 3+
>>>>> times slower:
>>>> I did a little benchmark:
>>>>
>>>> ldc -release -O5
>>>> true: 0.51
>>>> false: 0.63
>>>>
>>>> dmd -release -O
>>>> true: 4.47
>>>> false: 3.58
>>>>
>>>> I don't see a very big difference between slice copying and memcpy (but
>>>> between compilers).
>>>>
>>>> Btw.: http://www.digitalmars.com/pnews/read.php?
>>>> server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
>>> The original benchmark swapped insanely on my 1GB laptop so I've cut the
>>> number of iterations in half, to 50_000_000. Compiled with -O -release
>>> -inline. Results:
>>>
>>> slice: 2.31
>>> memcpy: 0.73
>>>
>>> That's 3 times difference. Disassembly:
>>>
>>> slice:
>>> L31: mov ECX,EDX
>>> mov EAX,6
>>> lea ESI,010h[ESP]
>>> mov ECX,EAX
>>> mov EDI,EDX
>>> rep
>>> movsb
>>> add EDX,6
>>> add EBX,6
>>> cmp EBX,011E1A300h
>>> jb L31
>>>
>>> memcpy:
>>> L35: push 6
>>> lea ECX,014h[ESP]
>>> push ECX
>>> push EBX
>>> call near ptr _memcpy
>>> add EBX,6
>>> add ESI,6
>>> add ESP,0Ch
>>> cmp ESI,011E1A300h
>>> jb L35
>>>
>>> Seems like rep movsb is /way/ sub-optimal for copying data.
>> Definitely! The difference ought to be bigger than a factor of 3. Which
>> means that memcpy probably isn't anywhere near optimal, either.
>> rep movsd is always 4 times quicker than rep movsb. There's a range of
>> lengths for which rep movsd is optimal; outside that range, there's are
>> other options which are even faster.
>>
>> So there's a factor of 4-8 speedup available on most memory copies.
>> Low-hanging fruit! <g>
>
> Don't disregard the function call overhead. memcpy is called 50 M
> times, copying only 6 bytes per call.
Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's
six bytes -- it's in the asm. Blimey. It should just be doing that as a
direct sequence of loads and stores, for anything up to at least 8 bytes.
More information about the Digitalmars-d
mailing list