Fix Phobos dependencies on autodecoding

Gregor Mückl gregormueckl at gmx.de
Wed Aug 14 09:29:30 UTC 2019


On Wednesday, 14 August 2019 at 07:15:54 UTC, Argolis wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
>
>> But we can't make that the default because it's a big 
>> performance hit, and many string algorithms don't actually 
>> need grapheme segmentation.
>
> Can you provide example of algorithms and use cases that don't 
> need grapheme segmentation?
> Are they really SO common that the correct default is go for 
> code points?
>
> Is it not better to have as a default the grapheme 
> segmentation, the correct way of handling a string, instead?

There is no single universally correct way to segment a string. 
Grapheme segmentation requires a correct assumption of the text 
encoding in the string and also the assumption that the encoding 
is flawless. Neither may be guaranteed in general. There is a lot 
of ways to corrupt UTF-8 strings, for example. And then there is 
a question of the length of a grapheme: IIRC they can consist of 
up to 6 or 7 code points with each of them encoded in a varying 
number of bytes in UTF-8, UTF-16 or UCS-2. So what data type do 
you use for representing graphemes then that is both not wasteful 
and doesn't require dynamic memory management?

Then there are other nasty quirks around graphemes: their 
encoding is not unique. This Unicode TR gives a good impression 
of how complex this single aspect is: 
https://unicode.org/reports/tr15/

So if you want to use graphemes, do you want to keep the original 
encoding or do you implicitly convert them to NFC or NFD? NFC 
tends to be better for language processing, NFD tends to be 
better for text rendering (with exceptions). If you don't 
normalize, semantically equivalent graphemes may not be equal 
under comparison.

At this point you're probably approaching the complexity of 
libraries like ICU. You can take a look at it if you want a good 
scare. ;)


More information about the Digitalmars-d mailing list