Help optimize D solution to phone encoding problem: extremely slow performace.

Fri Jan 19 05:17:51 UTC 2024

On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via Digitalmars-d-learn wrote:
[...]
> Ok, last time I'm running this for someone else :D
> 
> ```
> Proc,Run,Memory(bytes),Time(ms)
> ===> ./rust
> ./rust,23920640,30
> ./rust,24018944,147
> ./rust,24068096,592
> ./rust,24150016,1187
> ./rust,7766016,4972
> ./rust,8011776,46101
> ===> src/d/dencoder
> src/d/dencoder,44154880,42
> src/d/dencoder,51347456,87
> src/d/dencoder,51380224,273
> src/d/dencoder,51462144,441
> src/d/dencoder,18644992,4414
> src/d/dencoder,18710528,43548
> ```

OK, this piqued my interest enough that I decided to install rust using
rustup instead of my distro's package manager.  Here are the numbers I
got for my machine:

===> ./rust
./rust,22896640,35
./rust,22896640,137
./rust,22384640,542
./rust,22896640,1034
./rust,8785920,2489
./rust,8785920,12157
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,72
src/d/dencoder,1066799104,198
src/d/dencoder,1066799104,344
src/d/dencoder,1035292672,2372
src/d/dencoder,1035292672,13867

Looks like we lost out to Rust for larger inputs. :-D  Probably due to
environmental factors (and the fact that std.stdio is slow).  I re-ran
it again and got this:

===> ./rust
./rust,22896640,30
./rust,22896640,131
./rust,22896640,511
./rust,22896640,983
./rust,8785920,3102
./rust,8785920,9912
===> src/d/dencoder
src/d/dencoder,1066799104,36
src/d/dencoder,1066799104,71
src/d/dencoder,1066799104,197
src/d/dencoder,1066799104,355
src/d/dencoder,1035292672,3441
src/d/dencoder,1035292672,9471

Notice the significant discrepancy between the two runs; this seems to
show that the benchmark is only accurate up to about ±1.5 seconds.

Anyway, oddly enough, Java seems to beat Rust on larger inputs.  Maybe
my Java compiler has a better JIT implementation? :-P

> Congratulations on beating Rust :D but remember: you're using a much
> more efficient algorithm! I must conclude that the Rust translation of
> the Trie algorithm would be much faster still, unfortunately (you may
> have noticed that I am on D's side here!).

At this point, it's not really about the difference between languages
anymore; it's about the programmer's skill at optimizing his code.

Traditionally Java is thought to be the slowest, because it runs in a VM
and generally tends to use more heap allocations.  In recent times,
however, JIT and advanced GC implementations have significantly levelled
that out, so you're probably not going to see the difference unless you
hand-tweak your code down to the bare metal.

Surprisingly, at least on my machine, Lisp actually performed the worst.
I'd have thought it would at least beat Java, but I was quite wrong. :-D
Perhaps the Lisp implementation I'm using is suboptimal, I don't know.
Or perhaps modern JVMs have really overtaken Lisp.

Now I'm really curious how a Rust version of the trie algorithm would
perform.  Unfortunately I don't know Rust so I wouldn't be able to write
it myself. (Hint, hint, nudge, nudge ;-)).

As far as the performance of my D version is concerned, I still haven't
squeezed out all the performance I could yet.  Going into this, my
intention was to take the lazy way of optimizing only what the profiler
points out to me, with the slight ulterior motive of proving that a
relatively small amount of targeted optimizations can go a long way at
making the GC a non-problem in your typical D code. ;-)  I haven't
pulled out all the optimization guns at my disposal yet.

If I were to go the next step, I'd split up the impl() function so that
I get a better profile of where it's spending most of its time, and then
optimize that.  My current suspicion is that the traversal of the trie
could be improved by caching intermediate results to eliminate a good
proportion of recursive calls in impl().

Also, the `print` mode of operation is quite slow, probably because
writefln() allocates. (It allocates less than if I had used .format like
I did before, but it nevertheless still allocates.) To alleviate this
cost, I'd allocate an output buffer and write to that, flushing only
once it filled up.

Another thing I could do is to use std.parallelism.parallel to run
searches on batches of phone numbers in parallel. This is kinda
cheating, though, since it's the same algorithm with the same cost,
we're just putting more CPU cores to work. :-P  But in D this is quite
easy to do, often as easy as simply adding .parallel to your outer
foreach loop. In this particular case it will need some additional
refactoring due to the fact that the input is being read line by line.
But it's relatively easy to load the input into a buffer by chunks
instead, and just run the searches on all the numbers found in the
buffer in parallel.

On Thu, Jan 18, 2024 at 04:25:45PM +0000, Renato via Digitalmars-d-learn wrote:
[...]
> BTW here's you main function so it can run on the benchmark:
[...]

Thanks, I've adapted my code accordingly and pushed to my github repo.

T

-- 
This is a tpyo.