Help optimize D solution to phone encoding problem: extremely slow performace.

Renato renato at athaydes.com
Sat Jan 20 13:07:39 UTC 2024


Wow, fantastic feedback from lots of people, thanks so much!


I will try to answer all points raised in the several answers 
I've got here since yesterday.

On Saturday, 20 January 2024 at 01:27:50 UTC, H. S. Teoh wrote:
>
> I'm confused by the chained hashing of the digits. Why is that 
> necessary?  I would have thought it'd be faster to hash the 
> entire key instead of computing the hash of each digit and 
> chaining them together.
>
> I looked up Rust's ahash algorithm. Apparently they leverage 
> the CPU's hardware AES instruction to compute a 
> collision-resistant hash very quickly.
>
> Somebody should file a bug on druntime to implement this where 
> the hardware supports it, instead of the current hashOf. For 
> relatively small keys this would be a big performance boost.
>
>
> T

[This is the 
commit](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/e757f34c3ddd16240e3515bc91564f2a134659ff) I added "incremental hashing". What you're suggesting I also tried but it was much slower.

The reason incremental hashing is much faster than computing the 
hash of the whole key on each iteration is that you're doing a 
lot less work. To compute the hash of an array fully, you need to 
do something like this (it's just how hashing works):

```d
auto currentHash = <something>;
foreach(item: array) {
   currentHash = hashOf(item);
}
```

But by the structure of the problem, we know that we keep adding 
items to the `array` and re-computing the hash (in the 
`printTranslations` function, which is the hot path... so ignore 
the dictionary loading for now). What you're suggesting is do the 
above for the full array, on each iteration... what I did was 
optmise that so that we only re-compute the hash for each item 
instead per iteration - which is the only work that is required.

As Siarhei Siamashka mentioned: you could keep optimising this 
using heuristics and knowledge about the problem, at which point 
you might converge to a Trie implementation! But for the purpose 
of this comparison, I'm happy to stop at a good enough, kind of 
general purpose algorithm which is comparable in multiple 
languages... I might compare the D Trie impl with a Rust Trie 
impl in the future, but for now, I've had enough of that.

Hope that makes sense now... try to undo it and I believe it will 
run around 30% slower.

> One of the most important thing I found is that every call to 
> printTranslations allocates a new array (`auto keyValue = new 
> ubyte[...];`).  Given the large number of recursions involved 
> in this algorithm, this will add up to quite a lot.

Bingo! I found that before reading your suggestion :) you can see 
the commit where I changed that 
[here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/4e8508c5d16a2b623a4673d9c056a68e03b3bd87). This one-line change was the most effective change, making the D impl not only consume a lot less memory, but become quite a bit faster (I will double check later , but I believe this made the code run almost 50% faster in the longest runs)!

> Secondly, your hash function looks suspicious. Why are you 
> chaining your hash per digit?

I've answered that above... but I went further and instead of 
using the built-in `hashof` function, I decided to use [my own 
hashing 
function](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3) which takes advantage of the input characteristics... I mentioned before the trick I did to make the int128 solution faster (by the way, the int128 solution was not acceptable, as @ssvb pointed out, that's why I had to remove that). This is a very simple hashing function that works well enough for this particular problem, and this also boosted performance noticeably.

> Then a minor point: I wouldn't use Array in printTranslations. 
> It's overkill for what you need to do; a built-in array would 
> work just fine.

Hm, sorry to break it to you but [changing 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b927bc9cc4e33f9f6ba457d8abc3bccbc17c7f1e) to an Array was the single greatest improvement I had. IIRC it made the code 2x faster.

Maybe I was using native arrays wrongly before... but it's 
unclear to me how I could've made that better (and the docs 
themselves suggest to use Array if performance matters, which I 
seem to have confirmed is good advice).

> Next, what stood out is ISolutionHandler.  If I were to write 
> this, I wouldn't use OO for this at all, and especially not 
> interfaces, because they involve a double indirection.

As mentioned by Daniel Kozak, this doesn't matter. I tried 
rewriting it so that there was no indirection to the extent 
possible (I used delegate fields within a final class so the 
functions to use are selected at startup, so there's no overhead 
choosing which functions to call at runtime) and it was actually 
slightly slower (but too close to call really). My solution with 
interfaces had two final classes used, so I think there was very 
little overhead as no vtable was needed, right? You can check [my 
change 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d96a73f95b8e204d9325ed90e2ad19cece5e0c18), let me know if you can find a flaw in it.

> Finally, a very a small point: in loadDictionary, you do an AA 
> lookup with `n in result`, and then if that returns null, you 
> create a new entry. This does two AA lookups, once 
> unsuccessfully, and the second time to insert the missing key.  
> You could use the .require operation with a delegate instead of 
> `in` followed by `if (... is null)`, which only requires a 
> single lookup.

Thanks for pointing that out. I [changed 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/1ee3d858466904a74468eb139675ec7cf2c690d0) to use `require`, but it made no perceptible difference... still I kept this in my "final" implementation because it was much cleaner!

> ryuukk_:
> You do hash map lookup for every character in D, it's slow, 
> whereas in Rust you do it via pattern matching, java does the 
> same, pattern matching

I [rewrote 
that](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/b8c072f113f5a36b545516051181ef877a8821e8) to use a switch. The speed seems to have increased very slightly for the very small inputs, it's difficult to tell.
But I kept this change anyway because it was simpler and closer 
to the Rust implementation.

Notice that this table lookup is no in the hot path, so as 
expected, it only seems to speed up slightly loading the 
dictionary (which is good because D was behind for small inputs - 
and even this tiny improvement could help D catch up).

> To save from having to write a old-school verbose switch, i 
> suspect he went with a hashmap, wich is slower in that case, 
> hence why i keep advocate for that feature, or i should say, 
> that upgrade to switch, wich java has adopted, as well as rust:

The reason I used a hash table was that I believed that would be 
taking advantage of D's compile-time programming and would be 
even faster than a switch. To be completely sure that's not the 
case, I would need to benchmark that function individually, but 
I'm not so interested in knowing that as that made almost no 
difference in the implementation's performance anyway.

> Yet another reason to advocate for pattern matching in D and 
> switch as expression

Hm... not sure your argument holds here, given the above :/

> evilrat:
> There is another important difference, i quickly looked up D 
> associative array implementation and the search looks like 
> nlog(n) complexity with plain loop iteration of hashes, whereas 
> rust's implementation is using "swisstable" algorithm 
> implementation that has packed SIMD optimized lookups, this is 
> likely where the 3x speed difference comes from.

I am not using the default hash implementation in Rust (which is 
hashbrown as you've found out). That's too slow (because it's 
ddos secure - still Java's hash also is and it's still faster). I 
had to replace that with a library called `ahash`.

If you're interested in knowing more about that, please [read my 
blog 
post](https://renato.athaydes.com/posts/how-to-write-fast-rust-code) about optimising the Rust code.

So, no, that's not where the difference is coming from... in 
fact, I am using a faster hashing function in D.

You can [see my custom hashing function 
here](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/d1fa1b77ad928f7d536728fba11f2d69b4afe7e3). This function is not good for the general purpose case, but it's probably as good as it gets for this problem (see my int128 trick in a previous post on this thread to understand why). I did try a few variations of hashing before settling on this one... Rust, if anything, is at a disadvantage here.

One more thing I did was to avoid allocations in the print 
function so that D could have less memory allocations (which 
proved to be a big slowdown already) and perform better on the 
print problem.

This [change did make the code 
faster](https://github.com/renatoathaydes/prechelt-phone-number-encoding/commit/8937b80ec9c851ca19678b0a708c2ef84c499f8c) and it actually caught up with Rust for the smallest input.

> Jordan Wilson:
> I agree! Thanks for posting your benchmarks, I thought your 
> whole benching setup was pretty good, and learnt alot from your 
> code and the resulting contributions in the thread and others.

Thanks for that :) I am used to being mercilessly roosted in 
threads like this because no one likes to be told their favourite 
language is not so fast :D so any support is highly appreciated!

Anyway, here's the [fastest implementation]() I came up with 
based on all the feedback here that still resembles the Common 
Lisp algorithm and the Rust implementation:

https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/1ee3d858466904a74468eb139675ec7cf2c690d0/src/d/src/dencoder.d

(branch name is `dlang-key-hash-incremental-avoid-gc`)

Performance on Mac M1:

```
Proc,Run,Memory(bytes),Time(ms)
===> ./rust
./rust,23920640,36
./rust,24100864,144
./rust,23986176,581
./rust,24100864,1152
./rust,7815168,6644
./rust,7766016,11752
===> src/d/dencoder
src/d/dencoder,13385728,31
src/d/dencoder,24477696,262
src/d/dencoder,24477696,1173
src/d/dencoder,24395776,2313
src/d/dencoder,5701632,7108
src/d/dencoder,5767168,13692
```

Performance on Linux x86_64 (compiled with LDC 1.36.0 and dub `-b 
release-nobounds`):

```
Proc,Memory(bytes),Time(ms)

===> ./rust
./rust,23138304,62
./rust,23138304,236
./rust,23138304,932
./rust,23138304,1828
./rust,9027584,6108
./rust,9027584,13759
===> src/d/dencoder
src/d/dencoder,229851136,62
src/d/dencoder,229851136,403
src/d/dencoder,229851136,1789
src/d/dencoder,229851136,3573
src/d/dencoder,218996736,8254
src/d/dencoder,218996736,21412
```

Performance on Linux x86_64 using GDC:

```
Proc,Run,Memory(bytes),Time(ms)

===> ./rust
./rust,25178112,57
./rust,25178112,262
./rust,25178112,1082
./rust,25178112,2084
./rust,11067392,6328
./rust,11067392,35849
===> src/d/dencoder
src/d/dencoder,231968768,72
src/d/dencoder,231968768,635
src/d/dencoder,231968768,2875
src/d/dencoder,231968768,5739
src/d/dencoder,221110272,18195
src/d/dencoder,221110272,128987
```

D overhead with the fastest compiler, LDC, compared with Rust:

```
1.0
1.707627119
1.919527897
1.954595186
1.351342502
1.556217748
```

If anyone ever asks me, I will say that D can be as fast as 
Rust/C, but only if you're very careful using D features that 
perform well... if you just write clean D code (use the GC, 
native arrays etc.), you're likely to get at least 50% slower 
code.

Thanks everyone for all the input.


More information about the Digitalmars-d-learn mailing list