The Case Against Autodecode

Tue May 31 12:32:30 PDT 2016

On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
[...]
> Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

	şŭt̥ḛ́k̠

I think any reasonable person would have to say it should return 5,
because there are 5 visual "characters" here. Otherwise, what is even
the meaning of walkLength?! For it to return anything other than 5 means
that it's a leaky abstraction, because it's leaking low-level
"implementation details" of the Unicode representation of this string.

However, with the current implementation of autodecoding, walkLength
returns 11.  Can anyone reasonably argue that it's reasonable for
"şŭt̥ḛ́k̠".walkLength to equal 11?  What difference does this make if we
get rid of autodecoding, and walkLength returns 17 instead? *Both* are
wrong.

17 is actually the right answer if you're looking to allocate a buffer
large enough to hold this string, because that's the number of bytes it
occupies.

5 is the right answer to an end user who knows nothing about Unicode.

11 is an answer that a question that only makes sense to a Unicode
specialist, and that no layperson understands.

11 is the answer we currently give. And that, at the cost of
across-the-board performance degradation.  Yet you're seriously arguing
that 11 should be the right answer, by insisting that the current
implementation of autodecoding is "correct".  It boggles the mind.

T

-- 
Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.