Help optimize D solution to phone encoding problem: extremely slow performace.

Fri Jan 19 08:57:40 UTC 2024

On Friday, 19 January 2024 at 05:17:51 UTC, H. S. Teoh wrote:
> On Thu, Jan 18, 2024 at 04:23:16PM +0000, Renato via 
> Digitalmars-d-learn wrote: [...]
>> Ok, last time I'm running this for someone else :D
>> 
>> ```
>> Proc,Run,Memory(bytes),Time(ms)
>> ===> ./rust
>> ./rust,23920640,30
>> ./rust,24018944,147
>> ./rust,24068096,592
>> ./rust,24150016,1187
>> ./rust,7766016,4972
>> ./rust,8011776,46101
>> ===> src/d/dencoder
>> src/d/dencoder,44154880,42
>> src/d/dencoder,51347456,87
>> src/d/dencoder,51380224,273
>> src/d/dencoder,51462144,441
>> src/d/dencoder,18644992,4414
>> src/d/dencoder,18710528,43548
>> ```
>
> OK, this piqued my interest enough that I decided to install 
> rust using rustup instead of my distro's package manager.  Here 
> are the numbers I got for my machine:
>
> ===> ./rust
> ./rust,22896640,35
> ./rust,22896640,137
> ./rust,22384640,542
> ./rust,22896640,1034
> ./rust,8785920,2489
> ./rust,8785920,12157
> ===> src/d/dencoder
> src/d/dencoder,1066799104,36
> src/d/dencoder,1066799104,72
> src/d/dencoder,1066799104,198
> src/d/dencoder,1066799104,344
> src/d/dencoder,1035292672,2372
> src/d/dencoder,1035292672,13867
>
> Looks like we lost out to Rust for larger inputs. :-D  Probably 
> due to environmental factors (and the fact that std.stdio is 
> slow).  I re-ran it again and got this:
>
> ===> ./rust
> ./rust,22896640,30
> ./rust,22896640,131
> ./rust,22896640,511
> ./rust,22896640,983
> ./rust,8785920,3102
> ./rust,8785920,9912
> ===> src/d/dencoder
> src/d/dencoder,1066799104,36
> src/d/dencoder,1066799104,71
> src/d/dencoder,1066799104,197
> src/d/dencoder,1066799104,355
> src/d/dencoder,1035292672,3441
> src/d/dencoder,1035292672,9471
>
> Notice the significant discrepancy between the two runs; this 
> seems to show that the benchmark is only accurate up to about 
> ±1.5 seconds.

That's not correct. The discrepancy is due to the benchmark 
always generating different input on each run - and the 
characteristics of the input affects runs significantly. This 
affects the last two runs much more due to them using the more 
challenging dictionary. The important is that the relative 
performance between languages is reliable.

If you stop re-generating the phone numbers and just run the 
benchmark multiple times using the same input, you'll see it's 
very close between runs.

>
> Anyway, oddly enough, Java seems to beat Rust on larger inputs. 
>  Maybe my Java compiler has a better JIT implementation? :-P

Again, in case I haven't made it clear by repeating this multiple 
times: The Java code is using a Trie, the Rust code is using a 
numeric solution. They are completely different algorithms. Much 
more important than which language is being used is the 
algorithm, as has been shown again and again.

>
>> Congratulations on beating Rust :D but remember: you're using 
>> a much more efficient algorithm! I must conclude that the Rust 
>> translation of the Trie algorithm would be much faster still, 
>> unfortunately (you may have noticed that I am on D's side 
>> here!).
>
> At this point, it's not really about the difference between 
> languages anymore; it's about the programmer's skill at 
> optimizing his code.
>
> Traditionally Java is thought to be the slowest, because it 
> runs in a VM and generally tends to use more heap allocations.  
> In recent times, however, JIT and advanced GC implementations 
> have significantly levelled that out, so you're probably not 
> going to see the difference unless you hand-tweak your code 
> down to the bare metal.

Java has been very fast for over a decade now, this is not new at 
all.

>
> Surprisingly, at least on my machine, Lisp actually performed 
> the worst. I'd have thought it would at least beat Java, but I 
> was quite wrong. :-D Perhaps the Lisp implementation I'm using 
> is suboptimal, I don't know. Or perhaps modern JVMs have really 
> overtaken Lisp.

Please don't compare different algorithms in different languages 
and make conclusions about each language's speed.

>
> Now I'm really curious how a Rust version of the trie algorithm 
> would perform.  Unfortunately I don't know Rust so I wouldn't 
> be able to write it myself. (Hint, hint, nudge, nudge ;-)).
>
> As far as the performance of my D version is concerned, I still 
> haven't squeezed out all the performance I could yet.  Going 
> into this, my intention was to take the lazy way of optimizing 
> only what the profiler points out to me, with the slight 
> ulterior motive of proving that a relatively small amount of 
> targeted optimizations can go a long way at making the GC a 
> non-problem in your typical D code. ;-)  I haven't pulled out 
> all the optimization guns at my disposal yet.

You don't really have to... @ssvb's solution is incredibly fast 
at the "count" problem and I really don't think anyone can beat 
that implementation. The only problem is that the implementation 
is very C-like and nothing like D I would write.

>
> If I were to go the next step, I'd split up the impl() function 
> so that I get a better profile of where it's spending most of 
> its time, and then optimize that.  My current suspicion is that 
> the traversal of the trie could be improved by caching 
> intermediate results to eliminate a good proportion of 
> recursive calls in impl().
>
> Also, the `print` mode of operation is quite slow, probably 
> because writefln() allocates. (It allocates less than if I had 
> used .format like I did before, but it nevertheless still 
> allocates.) To alleviate this cost, I'd allocate an output 
> buffer and write to that, flushing only once it filled up.
>
> Another thing I could do is to use std.parallelism.parallel to 
> run searches on batches of phone numbers in parallel. This is 
> kinda cheating, though, since it's the same algorithm with the 
> same cost, we're just putting more CPU cores to work. :-P  But 
> in D this is quite easy to do, often as easy as simply adding 
> .parallel to your outer foreach loop. In this particular case 
> it will need some additional refactoring due to the fact that 
> the input is being read line by line. But it's relatively easy 
> to load the input into a buffer by chunks instead, and just run 
> the searches on all the numbers found in the buffer in parallel.
>

I don't have interest in doing that because what I was trying to 
accomplish was to compare the exact same algorithm in different 
languages to the extent possible. This is why I ported the Common 
Lisp code to D, then tried to optimize from there just like I did 
for Rust. I find the numeric implementation much more "valuable" 
in comparing performance than the Trie solution because that 
evaluates quite a few things in the language that are common to 
most programs, like hashing, recursion and "search" in general.

Do you know why the whole thread seems to have disappeared?? 
There's a lot of good stuff in the thread, it would be a huge 
shame to lose all that!