The Case Against Autodecode

Marc Schütz via Digitalmars-d digitalmars-d at puremagic.com
Sat May 28 03:59:56 PDT 2016


On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:
> On 5/27/16 6:56 AM, Marc Schütz wrote:
>> It is not, which has been shown by various posts in this 
>> thread.
>
> Couldn't quite find strong arguments. Could you please be more 
> explicit on which you found most convincing? -- Andrei

There are several possibilities of what iteration over a char 
range can mean. (For the sake of simplicity, let's ignore special 
cases like `find` and `split`; instead, let's look at 
`walkLength`, `retro` and similar.)

BEFORE the introduction of auto decoding, it used to iterate over 
UTF8 code _units_, which is wrong for any non-ASCII data (except 
for the unlikely case where you really want code units).

AFTER the introduction of auto decoding, it iterates over UTF8 
code _points_, which is wrong for combined characters, e.g. 
äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the 
even more unlikely case where you really want code points).

That is, both the BEFORE and AFTER behaviour are wrong, both 
break for various kinds of input in different ways.

So, is AFTER an improvement over BEFORE? The set of inputs where 
auto decoding produces wrong output is likely smaller, making it 
slightly less likely to encounter problems in practice; on the 
other hand, it's still wrong, and it's harder to find these 
problems during testing. That's like "improving" a bicycle so 
that it only breaks down after riding it for 30 minutes instead 
of just after 10 minutes, so you won't notice it during a test 
ride.

But there are even more possibilities. It could iterate over 
graphemes, which is expensive, but more likely to produce the 
results that the user wants. Or it could iterate by lines, or 
words (and there are different ways to define what a word is), 
and so on.

The fundamental problem is choosing one of those possibilities 
over the others without knowing what the user actually wants, 
which is what both BEFORE and AFTER do.

So, what was the original goal when introducing auto decoding? To 
improve correctness, right? I would argue that this goal has not 
been achieved. Have a look at the article [1], which IMO gives 
good criteria for how a _correct_ string type should behave. Both 
BEFORE and AFTER fail most of them.

[1] https://mortoray.com/2013/11/27/the-string-type-is-broken/


More information about the Digitalmars-d mailing list