Help optimize D solution to phone encoding problem: extremely slow performace.

Sat Jan 13 11:03:42 UTC 2024

I like to use a phone encoding problem to determine the 
strenghtness and weaknesses of programming languages because this 
problem is easy enough I can write solutions in any language in a 
few hours, but complex enough to exercise lots of interesting 
parts of the language.

You can check [my initial blog 
post](https://renato.athaydes.com/posts/revisiting-prechelt-paper-comparing-languages) about this, which was an analysis or the original study by Prechelt about the productivity differences between programmers using different languages.

This original problem had a flaw when used in modern computers as 
the programs can find solutions so quickly that most of the time 
in the benchmarks was being used to actually print solutions to 
stdout, so I modified the problem so that there's an option to 
either print the solutions, or just count the number of solutions 
- so the programs still need to do all the work, but are not 
required to print anything other than how many solutions were 
found.

Anyway, I ported the Common Lisp solution to D because, like CL, 
D has built-in data structures like associative arrays and 
`BigInt` (at least it's in the stdlib)... I thought this would 
actually give D an edge! But it turns out D performs very badly 
for larger input sets. It starts quite well on smaller inputs, 
but scales very poorly.

My initial Rust solution also performed very poorly, so I'm 
afraid the same is happening with my initial D solution, despite 
my best effort to write something "decent".

You can find my D solution [in this Pull 
Request](https://github.com/renatoathaydes/prechelt-phone-number-encoding/pull/16).

The solution is almost identical, to the extent possible, to the 
Common Lisp solution, which can be [found 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/lisp/main.lisp).

The secret to high performance for the algorithm being used is 
having a very efficient `BigInt` implementation, and fast hashing 
function for the hash table (associative array in D, with 
`BigInt` as keys). Hence, I suppose D's hash function or `BigInt` 
are not that fast (Rust's default hash is also not great due to 
security concerns, but it's very easy to use a custom one which 
is much faster by changing a single line of code).

Anyway, here's the current relative performance (the other 
languages are pretty heavily optimised, the Java solution uses a 
different algorithm so it's not directly comparable, [the Rust 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/rust/phone_encoder/src/main.rs) uses approx. the same algorithm as used in CL and D, but instead of `BigInt`, it uses a `Vec<u8>` as key as that turned out to be faster - I may try that technique in D as well - but notice that even using Rust's slower BigInt, it was orders of magnitude faster than D).

```
Benchmarking...
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,109895680,377
java-Main,179634176,1025
java-Main,167149568,1621
java-Main,180203520,2493
java-Main,96481280,6112
java-Main,95526912,7989
===> sbcl --script src/lisp/main.fasl
sbcl,31780864,74
sbcl,79437824,888
sbcl,79613952,3991
sbcl,80654336,7622
sbcl,80420864,18623
sbcl,83402752,29503
===> ./rust
./rust,23257088,58
./rust,23437312,260
./rust,23433216,1077
./rust,23416832,2017
./rust,7106560,6757
./rust,7110656,10165
===> src/d/dencoder
src/d/dencoder,38748160,223
src/d/dencoder,75800576,3154
src/d/dencoder,75788288,14905
src/d/dencoder,75751424,30271
```

I had to abort D as it was taking too long.

The above is with the `print` option (that's normally slow when 
stdout is not buffered, but I did buffer D's writes and it's 
still very slow).

With the `count` option, which does not print anything except a 
number at the end, D took so long I couldn't even collect its 
results (I waited several minutes)... the other languages 
finished in much less than a minute:

```
Benchmarking...
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,124112896,7883
java-Main,107487232,9273
===> sbcl --script src/lisp/main.fasl
sbcl,82669568,25638
sbcl,83759104,33501
===> ./rust
./rust,7061504,9488
./rust,7127040,11441
===> src/d/dencoder
^C
(ldc-1.36.0)
```

I tried to profile the D code but the profiler seems to be broken 
on my OS (Mac):

```
▶ dub build -b profile
     Starting Performing "profile" build using 
/Users/renato/dlang/ldc-1.36.0/bin/ldc2 for x86_64.
     Building prechelt ~master: building configuration 
[application]
      Linking dencoder
(ldc-1.36.0)
prechelt-phone-number-encoding/src/d  dlang ✗                     
                                                             9m ◒
▶ cd ../..
(ldc-1.36.0)
programming/projects/prechelt-phone-number-encoding  dlang ✗      
                                                             9m ◒
▶ src/d/dencoder
[1]    17877 illegal hardware instruction  src/d/dencoder
(ldc-1.36.0)
```

It's a bit hard to do any optmisation while "blind".

So, I decided to post it here to collect some hints about what to 
try and whether I hit some common performance bottlenecks in my 
current solution.

Would appreciate anything you can suggest, as long as you respect 
the fact that I'm still learning D so you can't expect me to know 
everything about it.