Fix Phobos dependencies on autodecoding
Gregor Mückl
gregormueckl at gmx.de
Wed Aug 14 09:29:30 UTC 2019
On Wednesday, 14 August 2019 at 07:15:54 UTC, Argolis wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
>
>> But we can't make that the default because it's a big
>> performance hit, and many string algorithms don't actually
>> need grapheme segmentation.
>
> Can you provide example of algorithms and use cases that don't
> need grapheme segmentation?
> Are they really SO common that the correct default is go for
> code points?
>
> Is it not better to have as a default the grapheme
> segmentation, the correct way of handling a string, instead?
There is no single universally correct way to segment a string.
Grapheme segmentation requires a correct assumption of the text
encoding in the string and also the assumption that the encoding
is flawless. Neither may be guaranteed in general. There is a lot
of ways to corrupt UTF-8 strings, for example. And then there is
a question of the length of a grapheme: IIRC they can consist of
up to 6 or 7 code points with each of them encoded in a varying
number of bytes in UTF-8, UTF-16 or UCS-2. So what data type do
you use for representing graphemes then that is both not wasteful
and doesn't require dynamic memory management?
Then there are other nasty quirks around graphemes: their
encoding is not unique. This Unicode TR gives a good impression
of how complex this single aspect is:
https://unicode.org/reports/tr15/
So if you want to use graphemes, do you want to keep the original
encoding or do you implicitly convert them to NFC or NFD? NFC
tends to be better for language processing, NFD tends to be
better for text rendering (with exceptions). If you don't
normalize, semantically equivalent graphemes may not be equal
under comparison.
At this point you're probably approaching the complexity of
libraries like ICU. You can take a look at it if you want a good
scare. ;)
More information about the Digitalmars-d
mailing list