Help optimize D solution to phone encoding problem: extremely slow performace.

Tue Jan 16 20:28:49 UTC 2024

On Tue, Jan 16, 2024 at 06:54:56PM +0000, Renato via Digitalmars-d-learn wrote:
> On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote:
[...]
> > You are not allowed to emit "1" as the first token in the output as
> > long as there are any dictionary word matches at that position. The
> > relevant paragraph from the problem statement:

Ohhh now I get it.  Initially I misunderstood that as saying that if the
rest of the phone number has at least one match, then a digit is not
allowed.  Now I see that what it's actually saying is that even if some
random dictionary word matches at that position, even if it does not
lead to any full matches, then a digit is excluded.

[...]
> > I also spent a bit of time trying to figure out this nuance when
> > implementing my solution. It doesn't make much sense visually (no
> > back-to-back digits in the output either way), but that's how it is.
> 
> Exactly, this is one of the things that make this problem a bit
> annoying to solve :)

It's a strange requirement, for sure, but I don't think it's annoying.
It makes the problem more Interesting(tm). ;-)

Anyway, I've fixed the problem, now my program produces the exact same
output as Renato's repo. Code is posted below. Interestingly enough, the
running time has now halved to about 0.9 seconds for 1 million phone
numbers. I guess that's caused by the more stringent requirement
excluding many more matching possibilities, effectively pruning away
large parts of the search tree.

> @"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's
> also what I did when I "participated" in the study. Here's [my Trie
> solution in
> Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java).
> 
> These are basically the two common approaches to the problem: a Trie
> or a numeric-based table. According to the study, people who use
> scripting languages almost always go with the numeric approach, while
> people coming from lower level languages tend to use a data structure
> like Trie (if they don't know Trie, they come up with something
> similar which is fascinating), which is harder to implement but more
> efficient in general.

Interesting.  I guess my C/C++ background is showing. ;-)

I'm not sure what exactly motivated me to go this route; I guess it was
just my default preference of choosing the path of least work as far as
the algorithm is concerned: I chose the algorithm that did the least
amount of work needed to produce the right answer.  Scanning through
sections of the dictionary to find a match was therefore excluded; so my
first thought was an AA. But then how much of the initial prefix to look
up an in AA?  Since it can't be known beforehand, I'd have to gradually
lengthen the prefix to search for, which does a lot of repetitive work
(we keep looking up the first few digits repeatedly each time we search
for a longer prefix). Plus, multiple consecutive AA lookups is not
cache-friendly.  So my next thought was, what should I do such that I
don't have to look at the initial digits anymore once I already
processed it?  This line of thought naturally led to a trie structure.

Once I arrived at a trie structure, the next question was how exactly
dictionary entries would be stored in it.  Again, in the vein of doing
the least amount of work I could get away with, I thought, if I stored
words in the trie directly, with each edge encoding a letter, then
during the search I'd have to repeatedly convert letters to the
corresponding phone number digit and vice versa.  So why not do this
conversion beforehand, and store only phone digits in the trie?  This
also had the additional benefit of letting me effectively search
multiple letters simultaneously, since multiple letters map to the same
digit, so scanning a digit is equivalent to searching multiple letters
at the same time.  The output, of course, required the original form of
the words -- so the obvious solution was to attach the original words as
a list of words attached to the trie node representing the end of that
word.

Once this was all decided, the only remaining question was the search
algorithm. This turned out to take the most time in solving this
problem, due to the recursive nature of the search, I had to grapple
with where and how to make the recursive calls, and how to propagate
return values correctly.  The initial implementation only found word
matches, and did not allow the single digits.  Happily, the recursive
algorithm turned out to have enough structure to encode the single digit
requirements as well, although it took a bit of trial and error to find
the correct implementation.

> Can I ask you why didn't you use the [D stdlib
> Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure
> that would've worked, but did you consider that?

Haha, I didn't even think of that. :-D  I wouldn't have wanted to use it
anyway, because it was optimized for single codepoint lookups, and
wouldn't have matched what I wanted to do. Besides, I also understood it
as an internal helper structure used by std.uni, so it isn't really
designed as a trie implementation for general use.

Also, since performance was an issue in your OP, I wanted to avoid
off-the-bat any hidden costs that may be present in a pre-existing trie
implementation that I did not fully understand.  Maybe I'm just affected
with NIH syndrome, but IME, when performance is important you usually
want to write a custom data structure for your inner loop, one that's
fine-tuned to your specific algorithm.  Otherwise the impedance mismatch
may give you suboptimal results.

T

-- 
Knowledge is that area of ignorance that we arrange and classify. -- Ambrose Bierce