[Issue 2313] Poor array ~= append performance

Wed Aug 27 03:06:02 PDT 2008

d-bugmail at puremagic.com wrote:
> http://d.puremagic.com/issues/show_bug.cgi?id=2313
> 
> 
> 
> 
> 
> ------- Comment #4 from bugzilla at digitalmars.com  2008-08-27 01:05 -------
> Why is byte[] = byte[] slower than memcpy? The answer isn't very simple.
> Consider this program:
> 
> import std.c.string;
> 
> long timer()
> {
>     asm
>     {   naked                   ;
>         rdtsc                   ;
>         ret                     ;
>     }
> }
> 
> void test1(byte[] a, byte[] b)
> {
>     a[] = b[];
> }
> 
> void test2(byte[] a, byte[] b)
> {
>     memcpy(a.ptr, b.ptr, a.length);
> }
> 
> void main()
> {
>     for (int i = 4; i < 100_000_000; i *= 2)
>     {
>         auto a = new byte[i];
>         auto b = new byte[i];
> 
>         auto start = timer();
>         test1(a, b);
>         auto end = timer();
>         auto r1 = end - start;
> 
>         start = timer();
>         test2(a, b);
>         end = timer();
>         auto r2 = end - start;
> 
>         printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
>     }
> }
> 
> Running this program produces:
> 
> i:        4,    []=[]:      144,        memcpy:      568
> i:        8,    []=[]:      144,        memcpy:      300
> i:       16,    []=[]:      172,        memcpy:      324
> i:       32,    []=[]:      204,        memcpy:      344
> i:       64,    []=[]:      288,        memcpy:      276
> i:      128,    []=[]:      288,        memcpy:      272
> i:      256,    []=[]:      352,        memcpy:      364
> i:      512,    []=[]:      372,        memcpy:      424
> i:     1024,    []=[]:      552,        memcpy:      564
> i:     2048,    []=[]:      684,        memcpy:     1384
> i:     4096,    []=[]:     1344,        memcpy:     1772
> i:     8192,    []=[]:     2900,        memcpy:     3216
> i:    16384,    []=[]:     5292,        memcpy:     5472
> i:    32768,    []=[]:    11496,        memcpy:    10388
> i:    65536,    []=[]:    29484,        memcpy:    27480
> i:   131072,    []=[]:   110464,        memcpy:    67784
> i:   262144,    []=[]:   655580,        memcpy:   562400
> i:   524288,    []=[]:  1204124,        memcpy:  1107256
> i:  1048576,    []=[]:  2364588,        memcpy:  2272552
> i:  2097152,    []=[]:  4516440,        memcpy:  4417764
> i:  4194304,    []=[]:  8996992,        memcpy:  8817176
> i:  8388608,    []=[]: 20223908,        memcpy: 17717748
> i: 16777216,    []=[]: 35774952,        memcpy: 36094652
> i: 33554432,    []=[]: 71008068,        memcpy: 71246896
> i: 67108864,    []=[]: 142982284,       memcpy: 145473300
> 
> There's not much of a consistent conclusion to be drawn from that.
Except, we can conclude that
(1) Walter's machine has a 64Kb L1 data cache. The penalty for a cache 
miss is 1.5 clocks. It's probably an AMD CPU. Judging by the timing, it 
looks like a K8 (Hammer) <g>
(2) neither a[] = b[], nor memcpy(), attempt to optimise for cache 
misses. Both look like rep movsd; to me.

BTW,
(3) rtdsc doesn't serialise, so the counts for low numbers are pretty 
much garbage. You need to stick a mov EAX, 0; cpuid; in there.
(4) cache effects are giving memcpy a big advantage. If you swap the 
order of test1 and test2, you'll probably find the order reverses.

There's potential to do something about (2). Not easy though.