The Case Against Autodecode
Marc Schütz via Digitalmars-d
digitalmars-d at puremagic.com
Sat May 28 03:59:56 PDT 2016
On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:
> On 5/27/16 6:56 AM, Marc Schütz wrote:
>> It is not, which has been shown by various posts in this
>> thread.
>
> Couldn't quite find strong arguments. Could you please be more
> explicit on which you found most convincing? -- Andrei
There are several possibilities of what iteration over a char
range can mean. (For the sake of simplicity, let's ignore special
cases like `find` and `split`; instead, let's look at
`walkLength`, `retro` and similar.)
BEFORE the introduction of auto decoding, it used to iterate over
UTF8 code _units_, which is wrong for any non-ASCII data (except
for the unlikely case where you really want code units).
AFTER the introduction of auto decoding, it iterates over UTF8
code _points_, which is wrong for combined characters, e.g.
äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the
even more unlikely case where you really want code points).
That is, both the BEFORE and AFTER behaviour are wrong, both
break for various kinds of input in different ways.
So, is AFTER an improvement over BEFORE? The set of inputs where
auto decoding produces wrong output is likely smaller, making it
slightly less likely to encounter problems in practice; on the
other hand, it's still wrong, and it's harder to find these
problems during testing. That's like "improving" a bicycle so
that it only breaks down after riding it for 30 minutes instead
of just after 10 minutes, so you won't notice it during a test
ride.
But there are even more possibilities. It could iterate over
graphemes, which is expensive, but more likely to produce the
results that the user wants. Or it could iterate by lines, or
words (and there are different ways to define what a word is),
and so on.
The fundamental problem is choosing one of those possibilities
over the others without knowing what the user actually wants,
which is what both BEFORE and AFTER do.
So, what was the original goal when introducing auto decoding? To
improve correctness, right? I would argue that this goal has not
been achieved. Have a look at the article [1], which IMO gives
good criteria for how a _correct_ string type should behave. Both
BEFORE and AFTER fail most of them.
[1] https://mortoray.com/2013/11/27/the-string-type-is-broken/
More information about the Digitalmars-d
mailing list