[GSoC 2025] Templated Hooks - Weekly Update #11
AlbertG
albert.guiman at protonmail.com
Tue Aug 19 19:55:58 UTC 2025
This week I have worked on the infrastructure for benchmarking
the hooks using GDC [1].
Initially, I have used
```bash
./configure --enable-languages=d --disable-multilib
--disable-bootstrap --prefix="$GDC_INSTALL_DIR"
make
make install
```
But when running the benchmarks, I have observed a lot of weird
behavior, where some tests within a benchmark would run faster,
while others would run slower. After testing on another machine
(amd ryzen 7 6800hs), I got some interesting results, where the
same benchmark showed a lot more improvements than on the
previous machine. Going back to the first machine (intel
i5-12400), I ran **perf** for `_d_arrayappendT` with the
following options:
```bash
perf stat -e
cycles,instructions,cache-misses,cache-references,branch-misses
-ddd ./array_benchmark
```
The results are as follows:
**non-templated commit**
```bash
Performance counter stats for './array_benchmark':
179,067,785,612 cpu_core/cycles/
(38.60%)
553,915,262,669 cpu_core/instructions/
(46.30%)
164,123 cpu_core/cache-misses/
(53.80%)
10,027,757,572 cpu_core/cache-references/
(61.26%)
111,181,094 cpu_core/branch-misses/
(68.87%)
83,540,647,021 cpu_core/L1-dcache-loads/
(68.88%)
1,563,709,177 cpu_core/L1-dcache-load-misses/
(69.10%)
369,937,906 cpu_core/LLC-loads/
(68.99%)
11,051 cpu_core/LLC-load-misses/
(69.25%)
<not supported> cpu_core/L1-icache-loads/
2,476,263,699 cpu_core/L1-icache-load-misses/
(31.22%)
83,750,596,340 cpu_core/dTLB-loads/
(31.02%)
57,519 cpu_core/dTLB-load-misses/
(31.11%)
<not supported> cpu_core/iTLB-loads/
419,216 cpu_core/iTLB-load-misses/
(30.81%)
<not supported> cpu_core/L1-dcache-prefetches/
<not supported> cpu_core/L1-dcache-prefetch-misses/
30.677581787 seconds time elapsed
36.219778000 seconds user
34.687502000 seconds sys
```
**templated commit**
```bash
Performance counter stats for './array_benchmark':
202,305,491,720 cpu_core/cycles/
(38.11%)
618,915,431,209 cpu_core/instructions/
(45.51%)
165,746 cpu_core/cache-misses/
(52.90%)
10,410,277,386 cpu_core/cache-references/
(60.48%)
127,861,576 cpu_core/branch-misses/
(68.34%)
109,082,378,698 cpu_core/L1-dcache-loads/
(68.62%)
1,405,100,969 cpu_core/L1-dcache-load-misses/
(68.76%)
357,831,953 cpu_core/LLC-loads/
(69.63%)
9,834 cpu_core/LLC-load-misses/
(69.72%)
<not supported> cpu_core/L1-icache-loads/
2,458,129,753 cpu_core/L1-icache-load-misses/
(31.41%)
107,762,981,594 cpu_core/dTLB-loads/
(31.27%)
102,346 cpu_core/dTLB-load-misses/
(30.41%)
<not supported> cpu_core/iTLB-loads/
759,112 cpu_core/iTLB-load-misses/
(30.33%)
<not supported> cpu_core/L1-dcache-prefetches/
<not supported> cpu_core/L1-dcache-prefetch-misses/
35.941396452 seconds time elapsed
41.386275000 seconds user
35.076185000 seconds sys
```
The results show that there are definitely more TLB misses, which
got me thinking it was because of the binary size. Indeed, the
binary size was huge, around 11M. I then checked the sizes of
`libgphobos.a` and `libgdruntime.a` and they were also larger
than expected, so I went back to configuring the GDC build and
ended up with the following commands:
```bash
./configure --disable-checking --disable-libphobos-checking
--disable-shared --enable-static --disable-libgomp
--disable-libmudflap --disable-libquadmath --disable-libssp
--disable-nls --enable-lto --enable-languages=d
--disable-multilib --disable-bootstrap --prefix="$GDC_INSTALL_DIR"
make
make install-strip
```
Now the binary sizes are much more reasonable, and the run time
has improved a bit, but the differences between the templated and
non-templated versions are still there, which I am not sure how
to explain. It is possible that there is some weird interaction
between the GDC generated code and this intel cpu, perhaps even
related to the TLB or L1 cache (which is 12-way associative with
64 sets vs 8-way associative with 64 sets for the ryzen). If
anyone has any ideas on this, please let me know =).
For now, I will re-run the benchmarks on the intel machine, just
because the ldc benchmarks were run on it as well, but in the
future it would be great to have all hooks benchmarked on a more
potent and consistent machine.
On top of this, I have also refactored `_d_cast` [2], after
accidentally stumbling upon one of my previous PRs and noticing
that the code could be prettified a bit.
[1] https://github.com/teodutu/druntime-hooks-benchmarks/pull/14
[2] https://github.com/dlang/dmd/pull/21727
More information about the Digitalmars-d
mailing list