Dicebot on leaving D: It is anarchy driven development in all its glory.

H. S. Teoh hsteoh at quickfur.ath.cx
Mon Aug 27 15:36:31 UTC 2018


On Sun, Aug 26, 2018 at 11:12:10PM +0000, FeepingCreature via Digitalmars-d wrote:
[...]
> Can I just throw in here that I like autodecoding and I think it's
> good?  If you want ranges that iterate over bytes, then just use
> arrays of bytes.  If you want Latin1 text, use Latin1 strings. If you
> want Unicode, you get Unicode iteration. This seems right and proper
> to me. Hell I'd love if the language was *more* aggressive about
> validating casts to strings.

Actually, this is exactly the point that makes autodecoding so bad,
because it *looks like* correct Unicode iteration over characters, but
it actually isn't.  It's iteration over Unicode *code points*, which is
not the same thing as iteration over what people would think of as
"characters", which in Unicode is called graphemes (cf. byGrapheme).

So iterating over strings like "a\u301" will give you two codepoints,
even though it actually renders as a single grapheme. Unfortunately,
most of the time the iteration will look correct -- in most European
languages, so the programmer will suspect nothing wrong. Until the code
is then given a non-European Unicode string. Then it starts getting
wrong behaviour.

Not to mention that this incomplete solution represents an
across-the-board performance hit on all string-processing code (unless
it was explicitly written to bypass autodecoding with something like
byCodeUnit), even if the code in question doesn't even care about
Unicode and treats the strings as opaque byte sequences.

The illusion of simplicity and correctness that autodecoding gives is
misleading, and makes programmers think their code is OK, when the fact
of the matter is that to handle Unicode correctly, you *have* to
actually know that Unicode is and how it works.  You simply cannot
pretend that it bears any resemblance to the ASCII days of one code unit
per character (no, not even with UTF-32) and expect your code to behave
correctly with all valid Unicode input strings.

In fact, this very illusion was what made Andrei choose to go with
autodecoding in the first place, thinking that it would default to
correct behaviour. Unfortunately, the reality didn't match up with that
expectation.

The ideal solution would have been to make strings non-iterable by
default, and only iterable when the programmer chooses the mode of
iteration (explicitly specify byCodeUnit, byCodePoint, or byGrapheme).


T

-- 
What do you call optometrist jokes? Vitreous humor.


More information about the Digitalmars-d mailing list