Choosing an Approach for the Template Lowering of _d_arrayctor

Thu Nov 25 13:28:18 UTC 2021

Hi,

As part of SAoC 2021, I changed the lowering of `_d_arrayctor` 
from using the [runtime 
hook](https://github.com/dlang/druntime/blob/595707b1ac8a439b2a7243b0abf95d4bc56239ff/src/rt/arrayassign.d#L170-L201) to a template, via theses PRs:
- https://github.com/dlang/dmd/pull/13116
- https://github.com/dlang/druntime/pull/3627

We'll call the former approach the **hook** approach and the 
latter, the **template** approach.

This template approach required me to declare the returned array 
inside a union. This was needed because by declaring this array 
directly, the compiler was inserting a call to `__ArrayDtor` 
before exiting the function. This call was in contradiction with 
the error handling already performed by `_d_arrayctor`. I gave a 
few more details about this issue in [this 
post](https://forum.dlang.org/post/ukhovyoowgzjeaapyuvd@forum.dlang.org). However, this came with the drawback of not using NRVO, because of the [cast at end of the function](https://github.com/dlang/druntime/blob/595707b1ac8a439b2a7243b0abf95d4bc56239ff/src/core/internal/array/construction.d#L88). This additional copy causes a performance decrease, compared to the hook approach, as I'm going to show later in this post.

A third approach, which I call the **hack** approach, is very 
close to the one using the template. The difference is that the 
hack adds a third, unused, pointer-type parameter to 
`_d_arrayctor`, in order to change the function's strong purity 
to weak purity. The reasons for this unorthodox approach are 
better explained in [this 
post](https://forum.dlang.org/post/simesvkancmscrtsciwq@forum.dlang.org), and the PR for it is [here](https://github.com/dlang/druntime/pull/3587). Due to this hack, there is no need to declare the returned array inside the scope of `_d_arrayctor`, which also removes the need for the union trick. As a result, no additional copying is performed and this approach has increased performance when compared to the hook approach.

In order to measure the performance of each lowering, I designed 
[this](https://gist.github.com/teodutu/c6b01561e8eb0d12f9fc7ce11b519d9a) benchmark, which I compiled with dmd and ran, using all 3 approaches. It measures the time taken by the call to `_d_arrayctor` for 3 struct sizes: small (empty), medium (64 bytes) and large (256 bytes), as well as 3 array lengths: small (1 element), medium (64 elements) and large (256 elements). 256 elements is a large enough array, because the lowering to `_d_arrayctor` is only performed for static arrays, which are unlikely to be much larger. The numerical results can be seen below:
```
Running benchmark on branch master (_d_arrayctor as a hook):
_0B_Struct_1Elem @ 1000000 runs: average time = 13ms; std dev = 
9.53674e-06
_0B_Struct_64Elems @ 1000000 runs: average time = 280.4ms; std 
dev = 0.489898
_0B_Struct_256Elems @ 1000000 runs: average time = 1062.18ms; std 
dev = 0.433128
_64B_Struct_1Elem @ 1000000 runs: average time = 13.01ms; std dev 
= 0.0994988
_64B_Struct_64Elems @ 1000000 runs: average time = 309.99ms; std 
dev = 0.0994992
_64B_Struct_256Elems @ 1000000 runs: average time = 1194.11ms; 
std dev = 0.31289
_256B_Struct_1Elem @ 1000000 runs: average time = 15ms; std dev = 
1.43051e-05
_256B_Struct_64Elems @ 1000000 runs: average time = 482.23ms; std 
dev = 1.73698
_256B_Struct_256Elems @ 1000000 runs: average time = 2960.72ms; 
std dev = 1.20067

Running benchmark with _d_arrayctor as a template:
0B_Struct_1Elem @ 1000000 runs: average time = 9.6934ms; std dev 
= 0.019658
0B_Struct_64Elems @ 1000000 runs: average time = 219.086ms; std 
dev = 0.467551
0B_Struct_256Elems @ 1000000 runs: average time = 820.951ms; std 
dev = 1.88878
64B_Struct_1Elem @ 1000000 runs: average time = 15.6967ms; std 
dev = 0.046066
64B_Struct_64Elems @ 1000000 runs: average time = 287.54ms; std 
dev = 0.979795
64B_Struct_256Elems @ 1000000 runs: average time = 1177.63ms; std 
dev = 1.06447
256B_Struct_1Elem @ 1000000 runs: average time = 20.8763ms; std 
dev = 0.259867
256B_Struct_64Elems @ 1000000 runs: average time = 981.405ms; std 
dev = 0.67177
256B_Struct_256Elems @ 1000000 runs: average time = 4029.59ms; 
std dev = 0.511771

Running benchmark with _d_arrayctor as a template (with the 
hack-y 3rd parameter):
_0B_Struct_1Elem @ 1000000 runs: average time = 9.00001ms; std 
dev = 6.67572e-06
_0B_Struct_64Elems @ 1000000 runs: average time = 183.94ms; std 
dev = 0.525738
_0B_Struct_256Elems @ 1000000 runs: average time = 689.091ms; std 
dev = 0.286183
_64B_Struct_1Elem @ 1000000 runs: average time = 10.01ms; std dev 
= 0.0994987
_64B_Struct_64Elems @ 1000000 runs: average time = 227.23ms; std 
dev = 0.645833
_64B_Struct_256Elems @ 1000000 runs: average time = 925.25ms; std 
dev = 0.766486
_256B_Struct_1Elem @ 1000000 runs: average time = 11.07ms; std 
dev = 0.255147
_256B_Struct_64Elems @ 1000000 runs: average time = 417.92ms; std 
dev = 0.271293
_256B_Struct_256Elems @ 1000000 runs: average time = 2576.1ms; 
std dev = 8.36003
```

In order to better visualise the numbers above, I also plotted 
the bar charts below:

![_d_arrayctor Performance 
Comparison](https://i.imgur.com/s9JjXSi.png)

As you can see, the hack is faster than the hook. Its running 
times are about 15-25% lower than those of the hook for larger 
structs (64 and 256 bytes) and as much as 35% lower for the empty 
struct. On the other hand, the template approach, while being 
20-25% faster than the hook when called on the empty struct, is 
as much as 35% slower than the hook, on larger (256-byte) structs.

Due to this discrepancy in performance, I think the best approach 
is the hack, despite adding an unused parameter to 
`_d_arrayctor`. What do you think?

Thanks,
Teodor