Help optimize D solution to phone encoding problem: extremely slow performace.

Fri Jan 19 15:43:44 UTC 2024

On Fri, Jan 19, 2024 at 01:40:39PM +0000, Renato via Digitalmars-d-learn wrote:
> On Friday, 19 January 2024 at 10:15:57 UTC, evilrat wrote:
[...]
> > Additionally if you comparing D by measuring DMD performance -
> > don't.  It is valuable in developing for fast iterations, but it
> > lacks many modern optimization techniques, for that we have LDC and
> > GDC.
> 
> I tried with DMD again, and yeah, it's much slower.

For anything where performance is even remotely important, I wouldn't
even consider DMD.  It's a well-known fact that it produces suboptimal
executables.  Its only redeeming factor is really only its fast
turnaround time.  If fast turnaround is not important, I would always
use LDC or GDC instead.

> Here's the [current implementation in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/d/src/dencoder.d),
> and the roughly [equivalent Rust
> implementation](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-key-hash-incremental/src/rust/phone_encoder/src/main.rs).

Taking a look at this code:

One of the most important thing I found is that every call to
printTranslations allocates a new array (`auto keyValue = new
ubyte[...];`).  Given the large number of recursions involved in this
algorithm, this will add up to quite a lot.  If I were optimizing this
code, I'd look into ways of reducing, if not eliminating, this
allocation.  Observe that this allocation is needed each time
printTranslations recurses, so instead of making separate allocations,
you could put it on a stack. Either with alloca, or with my appendPath()
trick in my version of the code: preallocate a reasonably large buffer
and take slices of it each time you need a new keyValue array.

Secondly, your hash function looks suspicious. Why are you chaining your
hash per digit?  That's a lot of hash computations.  Shouldn't you just
hash the entire key each time?  That would eliminate the need to store a
custom hash in your key, you could just lookup the entire key at once.

Next, what stood out is ISolutionHandler.  If I were to write this, I
wouldn't use OO for this at all, and especially not interfaces, because
they involve a double indirection.  I'd just return a delegate instead
(single indirection, no object lookup).  This is a relatively small
matter, but when it's being used inside a hot inner loop, it could be
important.

Then a minor point: I wouldn't use Array in printTranslations. It's
overkill for what you need to do; a built-in array would work just fine.
Take a look at the implementation of Array and you'll see lots of
function calls and locks and GC root-adding and all that stuff. Most of
it doesn't apply here, of course, and is compiled out. Nevertheless, it
uses wrapped integer operations and range checks, etc.. Again, these are
all minor issues, but in a hot inner loop they do add up. Built-in
arrays let you literally just bump the pointer when adding an element.
Just a couple of instructions as opposed to several function calls.
Important difference when you're on the hot path.  Now, as I mentioned
earlier w.r.t. my own code, appending to built-in arrays comes with a
cost. So here's where you'd optimize by creating your own buffer and
custom push/pop operations. Something like appendPath() in my version of
the code would do the job.

Finally, a very a small point: in loadDictionary, you do an AA lookup
with `n in result`, and then if that returns null, you create a new
entry. This does two AA lookups, once unsuccessfully, and the second
time to insert the missing key.  You could use the .require operation
with a delegate instead of `in` followed by `if (... is null)`, which
only requires a single lookup.  Probably not an important point, but for
a large dictionary this might make a difference.

> The only "significant" difference is that in Rust, an enum
> `WordOrDigit` is used to represent currently known "words"... I [did
> try using that in
> D](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/dlang-int128-word-and-digit/src/d/src/dencoder.d),
> but it made the algorithm slower.
> 
> If you see anything in D that's not as efficient as it should be, or
> somehow "inferior" to what the Rust version is doing , please let me
> know.

Couldn't tell you, I don't know Rust. :-D

> Notice that almost all of the time is spent in the for-loop inside
> `printTranslations` (which is misnamed as it doesn't necessarily
> "print" anything, like it did earlier) - the rest of the code almost
> doesn't matter.
[...]

Of course, that's where your hot path is.  And that loop makes recursive
calls to printTranslations, so the entire body of the function could use
some optimization. ;-)

Try addressing the points I wrote above and see if it makes a
difference.

T

-- 
The two rules of success: 1. Don't tell everything you know. -- YHL