memcpy vs slice copy

Mon Mar 16 06:53:00 PDT 2009

Sergey Gromov wrote:
> Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:
> 
>> Sergey Gromov wrote:
>>> Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:
>>>
>>>> On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
>>>>
>>>>> While doing some string processing I've seen some unusual timings
>>>>> compared to the C code, so I have written this to see the situation
>>>>> better. When USE_MEMCPY is false this little benchmark runs about 3+
>>>>> times slower:
>>>> I did a little benchmark:
>>>>
>>>> ldc -release -O5
>>>> true: 0.51
>>>> false: 0.63
>>>>
>>>> dmd -release -O
>>>> true: 4.47
>>>> false: 3.58
>>>>
>>>> I don't see a very big difference between slice copying and memcpy (but 
>>>> between compilers).
>>>>
>>>> Btw.: http://www.digitalmars.com/pnews/read.php?
>>>> server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
>>> The original benchmark swapped insanely on my 1GB laptop so I've cut the
>>> number of iterations in half, to 50_000_000.  Compiled with -O -release
>>> -inline.  Results:
>>>
>>> slice: 2.31
>>> memcpy: 	0.73
>>>
>>> That's 3 times difference.  Disassembly:
>>>
>>> slice:
>>> L31:            mov     ECX,EDX
>>>                 mov     EAX,6
>>>                 lea     ESI,010h[ESP]
>>>                 mov     ECX,EAX
>>>                 mov     EDI,EDX
>>>                 rep
>>>                 movsb
>>>                 add     EDX,6
>>>                 add     EBX,6
>>>                 cmp     EBX,011E1A300h
>>>                 jb      L31
>>>
>>> memcpy:
>>> L35:            push    6
>>>                 lea     ECX,014h[ESP]
>>>                 push    ECX
>>>                 push    EBX
>>>                 call    near ptr _memcpy
>>>                 add     EBX,6
>>>                 add     ESI,6
>>>                 add     ESP,0Ch
>>>                 cmp     ESI,011E1A300h
>>>                 jb      L35
>>>
>>> Seems like rep movsb is /way/ sub-optimal for copying data.
>> Definitely! The difference ought to be bigger than a factor of 3. Which 
>> means that memcpy probably isn't anywhere near optimal, either.
>> rep movsd is always 4 times quicker than rep movsb. There's a range of 
>> lengths for which rep movsd is optimal; outside that range, there's are 
>> other options which are even faster.
>>
>> So there's a factor of 4-8 speedup available on most memory copies. 
>> Low-hanging fruit! <g>
> 
> Don't disregard the function call overhead.  memcpy is called 50 M
> times, copying only 6 bytes per call.

Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
six bytes -- it's in the asm. Blimey. It should just be doing that as a 
direct sequence of loads and stores, for anything up to at least 8 bytes.