[GSoC 2025] Templated Hooks - Weekly Update #11

Tue Aug 19 19:55:58 UTC 2025

This week I have worked on the infrastructure for benchmarking 
the hooks using GDC [1].

Initially, I have used

```bash
./configure  --enable-languages=d --disable-multilib 
--disable-bootstrap --prefix="$GDC_INSTALL_DIR"
make
make install
```

But when running the benchmarks, I have observed a lot of weird 
behavior, where some tests within a benchmark would run faster, 
while others would run slower. After testing on another machine 
(amd ryzen 7 6800hs), I got some interesting results, where the 
same benchmark showed a lot more improvements than on the 
previous machine. Going back to the first machine (intel 
i5-12400), I ran **perf** for `_d_arrayappendT` with the 
following options:

```bash
perf stat -e 
cycles,instructions,cache-misses,cache-references,branch-misses 
-ddd ./array_benchmark
```

The results are as follows:

**non-templated commit**
```bash
      Performance counter stats for './array_benchmark':

        179,067,785,612      cpu_core/cycles/                      
                                (38.60%)
        553,915,262,669      cpu_core/instructions/                
                                (46.30%)
                164,123      cpu_core/cache-misses/                
                                (53.80%)
         10,027,757,572      cpu_core/cache-references/            
                                (61.26%)
            111,181,094      cpu_core/branch-misses/               
                                (68.87%)
         83,540,647,021      cpu_core/L1-dcache-loads/             
                                (68.88%)
          1,563,709,177      cpu_core/L1-dcache-load-misses/       
                                (69.10%)
            369,937,906      cpu_core/LLC-loads/                   
                                (68.99%)
                 11,051      cpu_core/LLC-load-misses/             
                                (69.25%)
        <not supported>      cpu_core/L1-icache-loads/
          2,476,263,699      cpu_core/L1-icache-load-misses/       
                                (31.22%)
         83,750,596,340      cpu_core/dTLB-loads/                  
                                (31.02%)
                 57,519      cpu_core/dTLB-load-misses/            
                                (31.11%)
        <not supported>      cpu_core/iTLB-loads/
                419,216      cpu_core/iTLB-load-misses/            
                                (30.81%)
        <not supported>      cpu_core/L1-dcache-prefetches/
        <not supported>      cpu_core/L1-dcache-prefetch-misses/

           30.677581787 seconds time elapsed

           36.219778000 seconds user
           34.687502000 seconds sys
```

**templated commit**
```bash
      Performance counter stats for './array_benchmark':

        202,305,491,720      cpu_core/cycles/                      
                                (38.11%)
        618,915,431,209      cpu_core/instructions/                
                                (45.51%)
                165,746      cpu_core/cache-misses/                
                                (52.90%)
         10,410,277,386      cpu_core/cache-references/            
                                (60.48%)
            127,861,576      cpu_core/branch-misses/               
                                (68.34%)
        109,082,378,698      cpu_core/L1-dcache-loads/             
                                (68.62%)
          1,405,100,969      cpu_core/L1-dcache-load-misses/       
                                (68.76%)
            357,831,953      cpu_core/LLC-loads/                   
                                (69.63%)
                  9,834      cpu_core/LLC-load-misses/             
                                (69.72%)
        <not supported>      cpu_core/L1-icache-loads/
          2,458,129,753      cpu_core/L1-icache-load-misses/       
                                (31.41%)
        107,762,981,594      cpu_core/dTLB-loads/                  
                                (31.27%)
                102,346      cpu_core/dTLB-load-misses/            
                                (30.41%)
        <not supported>      cpu_core/iTLB-loads/
                759,112      cpu_core/iTLB-load-misses/            
                                (30.33%)
        <not supported>      cpu_core/L1-dcache-prefetches/
        <not supported>      cpu_core/L1-dcache-prefetch-misses/

           35.941396452 seconds time elapsed

           41.386275000 seconds user
           35.076185000 seconds sys
```

The results show that there are definitely more TLB misses, which 
got me thinking it was because of the binary size. Indeed, the 
binary size was huge, around 11M. I then checked the sizes of 
`libgphobos.a` and `libgdruntime.a` and they were also larger 
than expected, so I went back to configuring the GDC build and 
ended up with the following commands:

```bash
./configure --disable-checking --disable-libphobos-checking 
--disable-shared --enable-static --disable-libgomp 
--disable-libmudflap --disable-libquadmath --disable-libssp 
--disable-nls --enable-lto --enable-languages=d 
--disable-multilib --disable-bootstrap --prefix="$GDC_INSTALL_DIR"
make
make install-strip
```

Now the binary sizes are much more reasonable, and the run time 
has improved a bit, but the differences between the templated and 
non-templated versions are still there, which I am not sure how 
to explain. It is possible that there is some weird interaction 
between the GDC generated code and this intel cpu, perhaps even 
related to the TLB or L1 cache (which is 12-way associative with 
64 sets vs 8-way associative with 64 sets for the ryzen). If 
anyone has any ideas on this, please let me know =).

For now, I will re-run the benchmarks on the intel machine, just 
because the ldc benchmarks were run on it as well, but in the 
future it would be great to have all hooks benchmarked on a more 
potent and consistent machine.

On top of this, I have also refactored `_d_cast` [2], after 
accidentally stumbling upon one of my previous PRs and noticing 
that the code could be prettified a bit.

[1] https://github.com/teodutu/druntime-hooks-benchmarks/pull/14
[2] https://github.com/dlang/dmd/pull/21727