Fix Phobos dependencies on autodecoding

H. S. Teoh hsteoh at quickfur.ath.cx
Tue Aug 13 18:11:02 UTC 2019


On Tue, Aug 13, 2019 at 05:43:19PM +0000, Gregor Mückl via Digitalmars-d wrote:
[...]
> We must be seeing different things then. I've taken a screenshot of
> how the post looks to me:
> 
> http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png

Did you copy-n-paste the code and run it?  If you did, the browser may
have done some Unicode processing on the string literal and munged the
results.  Maybe spelling out the second string literal might help:

	writeln("приве\u0301т".retro);

Basically, the issue here is that "е\u0301" should be processed as a
single grapheme, but since it's two separate code points, auto-decoding
splits the grapheme, and when .retro is applied to it, the \u0301 is now
attached to the wrong code point.

This is probably not the best example, since е\u0301 isn't really how
Russian is normally written (it could be used in some learner
dictionaries to indicate stress, but it's non-standard and most printed
material don't do that).  Perhaps a better example might be Hangul Jamo
or Arabic ligatures, but I'm unfamiliar with those languages so I don't
know how to come up with a realistic example.

But the point is that according to Unicode, a grapheme consists of a
base character followed by zero or more combining diacritics.
Auto-decoding treats the base character separately from any combining
diacritics, because it iterates over code points rather than graphemes,
thus when the application is logically dealing with graphemes, you'll
get incorrect results.  But if you're working only with code points,
then auto-decoding works.

The problem is that most of the time, either (1) you're working with
"characters" ("visual" characters, i.e. graphemes), or (2) you don't
actually care about the string contents but just need to copy / move /
erase a substring.  For (1), auto-decoding gives the wrong results.  For
(2), auto-decoding wastes time decoding code units: you could have just
used a straight memcpy / memcmp / etc..

Unless you're implementing Unicode algorithms, you rarely need to work
with code points directly. And if you're implementing Unicode
algorithms, you already know (or should already know) at which level you
need to be working with (code units, code points, or graphemes), so you
hardly need the default iteration to be code points (just write
.byCodePoint for clarity).

It doesn't make sense to have Phobos iterate over code points *by
default* when it's not the common use case, represents a hidden
performance hit, and in spite of that still not 100% correct anyway.


T

-- 
Век живи - век учись. А дураком помрёшь.


More information about the Digitalmars-d mailing list