DMD and GDC are unnecessarily using heap allocations for closures
Siarhei Siamashka
siarhei.siamashka at gmail.com
Tue May 31 02:00:39 UTC 2022
On Monday, 30 May 2022 at 23:19:09 UTC, Iain Buclaw wrote:
> On Monday, 30 May 2022 at 06:47:24 UTC, Siarhei Siamashka wrote:
>> $ gdc-11.2.0 -O3 -g -frelease -flto test.d && time ./a.out
>> 55836809328
>>
>> real 0m6.520s
>> user 0m6.519s
>> sys 0m0.000s
>>
>> What do you think about all of this?
>
> Out of curiosity, are you linking in phobos statically or
> dynamically? You can force either with `-static-libphobos` or
> `-shared-libphobos`.
It's statically linked with libphobos. Both GDC and LDC can
inline everything here. And one major difference is that LDC is
also able to eliminate GC allocation:
https://d.godbolt.org/z/x1jK1M149
Another major difference is that LDC does some extra "cheat" to
use a 32-bit division instruction if dividend and divisor are
small enough. But it only does this trick for '-mcpu=x86-64'
(default) and stops doing it for '-mcpu=native' (which in my case
is nehalem): https://d.godbolt.org/z/8ExEqqE41
If everything is manually inlined into main function, then
benchmarks look like this:
```
$ ldc2 -O -g -release -mcpu=x86-64 test2.d && perf stat ./test2
55836809328
Performance counter stats for './test2':
1,920.80 msec task-clock:u # 0.986 CPUs
utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
245 page-faults:u # 0.128 K/sec
5,378,786,526 cycles:u # 2.800 GHz
1,745,747,872 stalled-cycles-frontend:u # 32.46%
frontend cycles idle
636,941,001 stalled-cycles-backend:u # 11.84%
backend cycles idle
7,218,615,757 instructions:u # 1.34 insn
per cycle
# 0.24
stalled cycles per insn
1,371,563,853 branches:u # 714.057 M/sec
45,272,029 branch-misses:u # 3.30% of
all branches
1.947334248 seconds time elapsed
1.921595000 seconds user
0.000000000 seconds sys
```
```
$ ldc2 -O -g -release -mcpu=nehalem test2.d && perf stat ./test2
55836809328
Performance counter stats for './test2':
4,599.54 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
235 page-faults:u # 0.051 K/sec
12,892,558,448 cycles:u # 2.803 GHz
4,894,073,820 stalled-cycles-frontend:u # 37.96%
frontend cycles idle
1,550,118,424 stalled-cycles-backend:u # 12.02%
backend cycles idle
4,995,853,241 instructions:u # 0.39 insn
per cycle
# 0.98
stalled cycles per insn
804,146,718 branches:u # 174.832 M/sec
44,490,815 branch-misses:u # 5.53% of
all branches
4.600090630 seconds time elapsed
4.599885000 seconds user
0.000000000 seconds sys
```
```
$ gdc-11.2.0 -O3 -g -frelease -flto test2.d && perf stat ./a.out
55836809328
Performance counter stats for './a.out':
4,604.69 msec task-clock:u # 0.995 CPUs
utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
172 page-faults:u # 0.037 K/sec
12,909,554,223 cycles:u # 2.804 GHz
4,693,546,651 stalled-cycles-frontend:u # 36.36%
frontend cycles idle
1,132,407,891 stalled-cycles-backend:u # 8.77%
backend cycles idle
5,313,064,245 instructions:u # 0.41 insn
per cycle
# 0.88
stalled cycles per insn
1,042,903,599 branches:u # 226.487 M/sec
41,603,467 branch-misses:u # 3.99% of
all branches
4.626366827 seconds time elapsed
4.605163000 seconds user
0.000000000 seconds sys
```
GDC and LDC become equally fast if closure allocations overhead
is negligible and if LDC does not use 32-bit division instead of
64-bit one.
More information about the Digitalmars-d
mailing list