The Case Against Autodecode

Fri Jun 3 01:44:20 PDT 2016

On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d 
wrote:
> The intent of autodecoding was to make std.algorithm work meaningfully
> with strings. As it's easy to see I just went through
> std.algorithm.searching alphabetically and found issues literally with
> every primitive in there. It's an easy exercise to go forth with the others.

It comes down to the question of whether it's better to fail quickly when
Unicode is handled incorrectly so that it's obvious that you're doing it
wrong, or whether it's better for it to work in a large number of cases so
that for a lot of code it "just works" but is still wrong in the general
case, and it's a lot less obvious that that's the case, so many folks won't
realize that they need to do more in order to have their string handling be
Unicode-correct.

With code units - especially UTF-8 - it becomes obvious very quickly that
treating each element of the string/range as a character is wrong. With code
points, you have to work far harder to find examples that are incorrect. So,
it's not at all obvious (especially to the lay programmer) that the Unicode
handling is incorrect and that their code is wrong - but their code will end
up working a large percentage of the time in spite of it being wrong in the
general case.

So, yes, it's trivial to show how operating on ranges of code units as if
they were characters gives incorrect results far more easily than operating
on ranges of code points does. But operating on code points as if they were
characters is still going to give incorrect results in the general case.

Regardless of auto-decoding, the anwser is that the programmer needs to
understand the Unicode issues and use ranges of code units or code points
where appropriate and use ranges of graphemes where appropriate. It's just
that if we default to handling code points, then a lot of code will be
written which treats those as characters, and it will provide the correct
result more often than it would if it treated code units as characters.

In any case, I've probably posted too much in this thread already. It's
clear that the first step to solving this problem is to improve Phobos so
that it handles ranges of code units, code points, and graphemes correctly
whether auto-decoding is involved or not, and only then can we consider the
possibility of removing auto-decoding (and even then, the answer may still
be that we're stuck, because we consider the resulting code breakage to be
too great). But whether Phobos retains auto-decoding or not, the Unicode
handling stuff in general is the same, and what we need to do to improve the
siutation is the same. So, clearly, I need to do a much better job of
finding time to work on D so that I can create some PRs to help the
situation.  Unfortunately, it's far easier to find a few minutes here and
there while waiting on other stuff to shoot off a post or two in the
newsgroup than it is to find time to substantively work on code. :|

- Jonathan M Davis