DMD and GDC are unnecessarily using heap allocations for closures

Tue May 31 02:00:39 UTC 2022

On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:
> On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
>> $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
>> 55836809328
>>
>> real	0m6.520s
>> user	0m6.519s
>> sys	0m0.000s
>>
>> What do you think about all of this?
>
> Out of curiosity, are you linking in phobos statically or 
> dynamically?  You can force either with `-static-libphobos` or 
> `-shared-libphobos`.

It's statically linked with libphobos. Both GDC and LDC can 
inline everything here. And one major difference is that LDC is 
also able to eliminate GC allocation: 
https://d.godbolt.org/z/x1jK1M149

Another major difference is that LDC does some extra "cheat" to 
use a 32-bit division instruction if dividend and divisor are 
small enough. But it only does this trick for '-mcpu=x86-64' 
(default) and stops doing it for '-mcpu=native' (which in my case 
is nehalem): https://d.godbolt.org/z/8ExEqqE41

If everything is manually inlined into main function, then 
benchmarks look like this:
```
$ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2
55836809328

  Performance counter stats for './test2':

           1,920.80 msec task-clock:u              #    0.986 CPUs 
utilized
                  0      context-switches:u        #    0.000 K/sec
                  0      cpu-migrations:u          #    0.000 K/sec
                245      page-faults:u             #    0.128 K/sec
      5,378,786,526      cycles:u                  #    2.800 GHz
      1,745,747,872      stalled-cycles-frontend:u #   32.46% 
frontend cycles idle
        636,941,001      stalled-cycles-backend:u  #   11.84% 
backend cycles idle
      7,218,615,757      instructions:u            #    1.34  insn 
per cycle
                                                   #    0.24  
stalled cycles per insn
      1,371,563,853      branches:u                #  714.057 M/sec
         45,272,029      branch-misses:u           #    3.30% of 
all branches

        1.947334248 seconds time elapsed

        1.921595000 seconds user
        0.000000000 seconds sys
```

```
$ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2
55836809328

  Performance counter stats for './test2':

           4,599.54 msec task-clock:u              #    1.000 CPUs 
utilized
                  0      context-switches:u        #    0.000 K/sec
                  0      cpu-migrations:u          #    0.000 K/sec
                235      page-faults:u             #    0.051 K/sec
     12,892,558,448      cycles:u                  #    2.803 GHz
      4,894,073,820      stalled-cycles-frontend:u #   37.96% 
frontend cycles idle
      1,550,118,424      stalled-cycles-backend:u  #   12.02% 
backend cycles idle
      4,995,853,241      instructions:u            #    0.39  insn 
per cycle
                                                   #    0.98  
stalled cycles per insn
        804,146,718      branches:u                #  174.832 M/sec
         44,490,815      branch-misses:u           #    5.53% of 
all branches

        4.600090630 seconds time elapsed

        4.599885000 seconds user
        0.000000000 seconds sys
```

```
$ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out
55836809328

  Performance counter stats for './a.out':

           4,604.69 msec task-clock:u              #    0.995 CPUs 
utilized
                  0      context-switches:u        #    0.000 K/sec
                  0      cpu-migrations:u          #    0.000 K/sec
                172      page-faults:u             #    0.037 K/sec
     12,909,554,223      cycles:u                  #    2.804 GHz
      4,693,546,651      stalled-cycles-frontend:u #   36.36% 
frontend cycles idle
      1,132,407,891      stalled-cycles-backend:u  #    8.77% 
backend cycles idle
      5,313,064,245      instructions:u            #    0.41  insn 
per cycle
                                                   #    0.88  
stalled cycles per insn
      1,042,903,599      branches:u                #  226.487 M/sec
         41,603,467      branch-misses:u           #    3.99% of 
all branches

        4.626366827 seconds time elapsed

        4.605163000 seconds user
        0.000000000 seconds sys
```

GDC and LDC become equally fast if closure allocations overhead 
is negligible and if LDC does not use 32-bit division instead of 
64-bit one.