The Case Against Autodecode

Jonathan M Davis via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 10:15:07 PDT 2016


On Friday, May 27, 2016 09:40:21 H. S. Teoh via Digitalmars-d wrote:
> On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
> > On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
> > > > > However the following do require autodecoding:
> > > > >
> > > > > s.walkLength
> > > > > s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
> > > > > s.count!(c => c >= 32) // non-control characters
> > > > >
> > > > > Currently the standard library operates at code point level even
> > > > > though inside it may choose to use code units when admissible.
> > > > > Leaving such a decision to the library seems like a wise thing
> > > > > to do.
> > > >
> > > > But how is the user supposed to know without being a core
> > > > contributor to Phobos?
> > >
> > > Misunderstanding. All examples work properly today because of
> > > autodecoding. -- Andrei
> >
> > They only work "properly" if you define "properly" as "in terms of
> > code points". But working in terms of code points is usually wrong. If
> > you want to count "characters", you need to work with graphemes.
> >
> > https://dpaste.dzfl.pl/817dec505fd2
>
> Exactly. And we just keep getting stuck on this point. It seems that the
> message just isn't getting through. The unfounded assumption continues
> to be made that iterating by code point is somehow "correct" by
> definition and nobody can challenge it.
>
> String handling, especially in the standard library, ought to be (1)
> efficient where possible, and (2) be as correct as possible (meaning,
> most corresponding to user expectations -- principle of least surprise).
> If we can't have both, we should at least have one, right? However, the
> way autodecoding is currently implemented, we have neither.

Exactly. Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct. More,
full characters fit in a single code unit, but they still don't all fit. You
have to go to the grapheme level for that.

IIRC, Andrei talked in TDPL about how UTF-8 was better than UTF-16, because
you figured out when you screwed up Unicode handling more quickly, because
very few Unicode characters fit in single UTF-8 code unit, whereas many more
fit in a single UTF-16 code unit, making it harder to catch errors with
UTF-16. Well, we're making the same mistake but with UTF-32 instead of
UTF-16. The code is still wrong, but it's that much harder to catch that
it's wrong.

> Firstly, it is beyond clear that autodecoding adds a significant amount
> of overhead, and because it's automatic, it applies to ALL string
> processing in D.  The only way around it is to fight against the
> standard library and use workarounds to bypass all that
> meticulously-crafted autodecoding code, begging the question of why
> we're even spending the effort on said code in the first place.

The standard library has to fight against itself because of autodecoding!
The vast majority of the algorithms in Phobos are special-cased on strings
in an attempt to get around autodecoding. That alone should highlight the
fact that autodecoding is problematic.

> The fact of the matter is that if you're going to write Unicode string
> processing code, you're gonna hafta to know the dirty nitty gritty of
> Unicode strings, including the fine distinctions between code units,
> code points, grapheme clusters, etc.. Since this is required knowledge
> anyway, why not just let the user worry about how to iterate over the
> string? Let the user choose what best suits his application, whether
> it's working directly with code units for speed, or iterating over
> grapheme clusters for correctness (in terms of visual "characters"),
> instead of choosing the pessimal middle ground that's neither efficient
> nor correct?

There is no solution here that's going to be both correct and efficient.
Ideally, we either need to provide a fully correct solution that's dog slow,
or we need to provide a solution that's efficient but requires that the
programmer understand Unicode to write correct code. Right now, we have a
slow solution that's incorrect.

- Jonathan M Davis



More information about the Digitalmars-d mailing list