Help optimize D solution to phone encoding problem: extremely slow performace.

Sun Jan 14 17:11:27 UTC 2024

On Sunday, 14 January 2024 at 10:02:58 UTC, Jordan Wilson wrote:
> On Saturday, 13 January 2024 at 11:03:42 UTC, Renato wrote:
>> I like to use a phone encoding problem to determine the 
>> strenghtness and weaknesses of programming languages because 
>> this problem is easy enough I can write solutions in any 
>> language in a few hours, but complex enough to exercise lots 
>> of interesting parts of the language.
>>
>> [...]
>
> Hello Renato,
>
> This seems to be quite a lot of calls:
> ```
> ======== Timer frequency unknown, Times are in Megaticks 
> ========
>
>   Num          Tree        Func        Per
>   Calls        Time        Time        Call
>
> 19204964        3761        3756           0     pure nothrow 
> ref @trusted immutable(char)[][] 
> core.internal.array.appending._d_arrayappendcTX!(immutable(char)[][], immutable(char)[])._d_arrayappendcTX(scope return ref immutable(char)[][], ulong)
>
> 19204924        8957        3474           0     @safe void 
> dencoder.printTranslations(immutable(char)[][][dencoder.Key], 
> dencoder.ISolutionHandler, immutable(char)[], 
> immutable(char)[], immutable(char)[][])
> ```
>
> This is when using the `words-quarter.txt` input (the 
> `dictionary.txt` input seems to finish much faster, although 
> still slower than `java`/`rust`).
>
> I also used only 100 phone numbers as input.
>
> My final observation is that `words-quarter.txt` contains some 
> 1-letter inputs, (for example, `i` or `m`)...this may result in 
> a large number of encoding permutations, which may explain the 
> high number of recursion calls?
>
> Jordan

The characteristics of the dictionary impact the number of 
solutions greatly. I explored this in my blog post about [Common 
Lisp, Part 
2](https://renato.athaydes.com/posts/revenge_of_lisp-part-2).

The fact there's a ridiculous amount of function calls is why 
this problem can take minutes even without having to allocate 
much memory or print anything.

** I've managed to improve the D code enough that it is now 
faster than Common Lisp and the equivalent algorithm in Java.**

It took some profiling to do that, though... thanks to 
@Anonymouse for the suggestion to use Valgrind... with that, I 
was able to profile the code nicely (Valgrind works nicely with 
D, it doesn't even show mangled names!).

Here's what I did.

First: the solution using int128 was much faster than the BigInt 
solution, as I had already mentioned. But when profiling that, it 
was clear that for the problems with a very large number of 
solutions, GC became a problem:

```
--------------------------------------------------------------------------------
23,044,096,944 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------
7,079,878,197 (30.72%)  
???:core.internal.gc.impl.conservative.gc.Gcx.mark!(false, true, 
true).mark(core.internal.gc.impl.conservative.gc.Gcx.ScanRange!(false).ScanRange) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
2,375,100,857 (10.31%)  
???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], immutable(char)[][])'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,971,210,820 ( 8.55%)  ???:_aaInX 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,922,961,924 ( 8.34%)  ???:_d_arraysetlengthT 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,298,747,622 ( 5.64%)  ???:core.int128.mul(core.int128.Cent, 
core.int128.Cent) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
1,134,644,706 ( 4.92%)  
???:core.internal.gc.bits.GCBits.setLocked(ulong) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
   849,587,834 ( 3.69%)  
???:core.internal.gc.impl.conservative.gc.Gcx.smallAlloc(ulong, 
ref ulong, uint, const(TypeInfo)) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
   827,407,988 ( 3.59%)  
./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
   688,845,027 ( 2.99%)  
./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
   575,415,884 ( 2.50%)  
???:_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
   562,146,592 ( 2.44%)  
???:core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) [/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
   526,067,586 ( 2.28%)  
???:core.internal.spinlock.SpinLock.lock() shared 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/dencoder-int128]
```

So, I [decided to use 
`std.container.Array`](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba457d8abc3bccbc17c7f1e) (I had tried Appender before but that didn't really work) to make sure no allocation was happening when adding/removing words from the `words` which are carried through the `printTranslations` calls (that's the hot path).

As you can see in the commit, the code is just slightly less 
convenient...
this made the code run considerably faster for the very long runs!

Here's the Callgrind profiling data AFTER this change:

```
6,547,316,944 (32.42%)  
???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
5,229,076,596 (25.90%)  ???:_aaInX 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
3,871,644,402 (19.17%)  ???:core.int128.mul(core.int128.Cent, 
core.int128.Cent) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   677,533,800 ( 3.36%)  ???:std.int128.Int128.toHash() const 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   543,388,688 ( 2.69%)  
???:std.int128.Int128.this(core.int128.Cent) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   543,388,688 ( 2.69%)  ???:std.int128.Int128.this(long, long) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   542,027,045 ( 2.68%)  ???:object.TypeInfo_Struct.getHash(scope 
const(void*)) const 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   424,849,937 ( 2.10%)  ???:object.TypeInfo_Struct.equals(in 
void, in void) const 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
```

No more GC! It seems that now, as expected, most time is spent 
calculating hashes for each possible solution (as it should be). 
I am not entirely sure what the `_aaInX` call is, but I believe 
that's the associative array's membership check?

The D profiler seems to show similar data:

```
======== Timer frequency unknown, Times are in Megaticks ========

   Num          Tree        Func        Per
   Calls        Time        Time        Call

104001600         158         158           0     const pure 
nothrow @nogc @safe std.int128.Int128 
std.int128.Int128.opBinary!("*").opBinary(std.int128.Int128)
402933936          63          63           0     const pure 
nothrow @property @nogc @safe bool 
std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized()
183744166          87          59           0     inout pure 
nothrow ref @property @nogc return @safe 
inout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload()
103784880          68          38           0     const pure 
nothrow @nogc @safe std.int128.Int128 
std.int128.Int128.opBinary!("+", int).opBinary(const(int))
103784880          99          31           0     pure nothrow 
ref @nogc @safe std.int128.Int128 
std.int128.Int128.opOpAssign!("+", int).opOpAssign(int)
104001600          30          30           0     const pure 
nothrow @nogc @safe std.int128.Int128 
std.int128.Int128.opBinary!("+").opBinary(std.int128.Int128)
61167377          67          29           0     pure nothrow 
@nogc void 
std.container.array.Array!(immutable(char)[]).Array.__fieldDtor()
43444575          61          27           0     pure nothrow 
@nogc @safe void 
core.internal.lifetime.emplaceRef!(immutable(char)[], 
immutable(char)[], immutable(char)[]).emplaceRef(ref 
immutable(char)[], ref immutable(char)[])
48427530          79          27           0     const pure 
nothrow @property @nogc @safe bool 
std.container.array.Array!(immutable(char)[]).Array.empty()
61167377          38          27           0     pure nothrow 
@nogc void 
std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__dtor()
43444575         130          24           0     pure nothrow 
@nogc ulong 
std.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[])
61167376          32          23           0     pure nothrow 
@nogc @safe void 
std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.__postblit()
```

Anyway, after this change, the code now scaled much better.

But there's nothing much lest to be optimised... except the 
arithmetics (notice how the `("*").opBinary` call is near the top.

I know how to make that faster using a similar strategy that I 
used in Rust (you can check my [journey to optimise the Rust code 
on my blog 
post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about that - specifically, check the `Using packed bytes for more efficient storage` section)... knowing how the encoding works, I knew that multiplication is not really needed. So, instead of doing this:

```d
n *= 10;
n += c - '0';
```

I could do this:

```d
n <<= 4;
n += c;
```

This changes the actual number, but that doesn't matter: the code 
is still unique for each number, which is all that we need.

You can see [the full commit 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c).

This improved the performance for sure, even if not by much.

The profiling data after this arithmetic trick looks like this:

Valgrind:

```
4,821,217,799 (35.75%)  
???:dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)'2 [/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
4,241,404,850 (31.45%)  ???:_aaInX 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
1,038,733,782 ( 7.70%)  ???:core.int128.shl(core.int128.Cent, 
uint) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   575,372,270 ( 4.27%)  ???:std.int128.Int128.toHash() const 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   461,659,456 ( 3.42%)  
???:std.int128.Int128.this(core.int128.Cent) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   460,297,816 ( 3.41%)  ???:object.TypeInfo_Struct.getHash(scope 
const(void*)) const 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   347,198,989 ( 2.57%)  ???:object.TypeInfo_Struct.equals(in 
void, in void) const 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
   288,537,160 ( 2.14%)  ???:core.int128.add(core.int128.Cent, 
core.int128.Cent) 
[/home/renato/programming/experiments/prechelt-phone-number-encoding/src/d/dencoder]
```

D Profiler:

```
======== Timer frequency unknown, Times are in Megaticks ========

   Num          Tree        Func        Per
   Calls        Time        Time        Call

29879388       44037       11267           0     void 
dencoder.printTranslations(immutable(char)[][][std.int128.Int128], dencoder.ISolutionHandler, immutable(char)[], immutable(char)[], std.container.array.Array!(immutable(char)[]).Array)
126827950        4306        3599           0     inout pure 
nothrow ref @property @nogc return @safe 
inout(std.container.array.Array!(immutable(char)[]).Array.Payload) std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.refCountedPayload()
29879268        7293        3100           0     pure nothrow 
@nogc ulong 
std.container.array.Array!(immutable(char)[]).Array.Payload.insertBack!(immutable(char)[]).insertBack(immutable(char)[])
33534720        4896        2780           0     const pure 
nothrow @property @nogc @safe bool 
std.container.array.Array!(immutable(char)[]).Array.empty()
29879268       11834        2479           0     pure nothrow 
@nogc ulong 
std.container.array.Array!(immutable(char)[]).Array.insertBack!(immutable(char)[]).insertBack(immutable(char)[])
29879268        8702        2279           0     pure nothrow 
@nogc @safe void 
std.container.array.Array!(immutable(char)[]).Array.removeBack()
282919518        2045        2045           0     const pure 
nothrow @property @nogc @safe bool 
std.typecons.RefCounted!(std.container.array.Array!(immutable(char)[]).Array.Payload, 0).RefCounted.RefCountedStore.isInitialized()
29879268        2534        1859           0     pure nothrow 
@nogc @safe void 
core.internal.lifetime.emplaceRef!(immutable(char)[], 
immutable(char)[], immutable(char)[]).emplaceRef(ref 
immutable(char)[], ref immutable(char)[])
44511077        3264        1505           0     pure nothrow 
@nogc void 
std.container.array.Array!(immutable(char)[]).Array.__fieldDtor()
44511076        2588        1254           0     pure nothrow 
@nogc scope void 
std.container.array.Array!(immutable(char)[]).Array.__fieldPostblit()
49758574        1684        1243           0     const pure 
nothrow @nogc @safe std.int128.Int128 
std.int128.Int128.opBinary!("+", char).opBinary(const(char))
49758574        2907        1223           0     pure nothrow ref 
@nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("+", 
immutable(char)).opOpAssign(ref immutable(char))
49975294        1796        1208           0     pure nothrow ref 
@nogc @safe std.int128.Int128 std.int128.Int128.opOpAssign!("<<", 
int).opOpAssign(int)
```

As far as I can tell, the only two bottlenecks now bacame:

* `std.typecons.RefCounted!(Array.Payload, 
0).RefCounted.refCountedPayload()`
* the `Int128` implementation!

To fix the former, I would need to implement my own `Array`, I 
suppose... using ref-counting here seems unnecessary??

But that is a bit beyond what I am willing to do!

The latter could probably be fixed by not using a numeric key at 
all, just bytes - but my previous attempt at doing that didn't 
really give better results. Or perhaps by overriding the hashing 
for Int128 (I didn't try that because I assume the default should 
be pretty optimal already)??

So, I ran out of options and am going to call it done.

[Here's my final D 
solution](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/0cbfd41a072718bfb0c0d0af8bb7266471e7e94c/src/d/src/dencoder.d).

The code is still readable (I think everyone can agree D is one 
of the most readable languages of all)!

You can see how it performs in the last comparison run I did 
([all data, including a nice chart, in this 
gist](https://gist.github.com/renatoathaydes/ab5c86b0ea59152693a7236c333ac334)):

```
Proc,Run,Memory(bytes),Time(ms)
===> java -Xms20M -Xmx100M -cp build/java Main
java-Main,3190730752,441
java-Main,3194789888,1440
java-Main,3193692160,2316
java-Main,3194720256,3604
java-Main,3192963072,12861
java-Main,3261128704,31282
===> sbcl --script src/lisp/main.fasl
sbcl,1275994112,83
sbcl,1275998208,1396
sbcl,1275998208,5925
sbcl,1275998208,9752
sbcl,1275998208,64811
sbcl,1276006400,153799
===> ./rust
./rust,23138304,56
./rust,23138304,234
./rust,23138304,1288
./rust,23138304,2444
./rust,9027584,15867
./rust,9027584,36985
===> src/d/dencoder
src/d/dencoder,219041792,67
src/d/dencoder,229904384,1114
src/d/dencoder,229904384,4421
src/d/dencoder,229904384,9087
src/d/dencoder,219041792,51315
src/d/dencoder,219041792,120818
```

## Conclusion

Using `Int128` as a key instead of `BigInt` made the code much 
faster (around 4x faster).

Using `Array` to avoid too many reallocations with primitive 
dynamic arrays made the code run much faster for very large 
numbers of calls (very little difference when running for less 
than 10s, but at least 2x faster at round 1min runs where the 
tiny overhead adds up).

As you can see [in the chart I 
posted](https://gist.github.com/renatoathaydes/ab5c86b0ea59152693a7236c333ac334), D's speed performance is close to that of Common Lisp for all runs, though between 10 and 20% faster. It's nowhere near Rust, unfortunately, with Rust being almost 4x faster (pls ignore the Java solution as that's using a Trie instead of the "numeric" solution - so it's not a fair comparison). I had expected to get very close to Rust, but that didn't happen... I just can't see in the profiling data what's causing D to fall so far behind!

On the memory data: D uses much less memory than Java and Common 
Lisp, but still a lot higher than Rust.

If anyone can find any flaw in my methodology or optmise my code 
so that it can still get a couple of times faster, approaching 
Rust's performance, I would greatly appreciate that! But for now, 
my understanding is that the most promising way to get there 
would be to write D in `betterC` style?!