Fix Phobos dependencies on autodecoding

Argolis argolis at gmail.com
Thu Aug 15 10:26:12 UTC 2019


On Wednesday, 14 August 2019 at 09:29:30 UTC, Gregor Mückl wrote:

> There is no single universally correct way to segment a string. 
> Grapheme segmentation requires a correct assumption of the text 
> encoding in the string and also the assumption that the 
> encoding is flawless. Neither may be guaranteed in general. 
> There is a lot of ways to corrupt UTF-8 strings, for example.

Are you meaning that there's no way to verify that assumptions?
Sorting algorithms in Phobos are returning a SortedRange.

> And then there is a question of the length of a grapheme: IIRC 
> they can consist of up to 6 or 7 code points with each of them 
> encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2. 
> So what data type do you use for representing graphemes then 
> that is both not wasteful and doesn't require dynamic memory 
> management?

It's performance the rationale of not using dynamic memory 
management, if that it's unavoidable to have a correct behaviour?

> Then there are other nasty quirks around graphemes: their 
> encoding is not unique. This Unicode TR gives a good impression 
> of how complex this single aspect is: 
> https://unicode.org/reports/tr15/
> So if you want to use graphemes, do you want to keep the 
> original encoding or do you implicitly convert them to NFC or 
> NFD? NFC tends to be better for language processing, NFD tends 
> to be better for text rendering (with exceptions). If you don't 
> normalize, semantically equivalent graphemes may not be equal 
> under comparison.

It's performance the rationale of not using normalisation, that 
solves all the problems you have mentioned above?

> At this point you're probably approaching the complexity of 
> libraries like ICU. You can take a look at it if you want a 
> good scare. ;)

The original question still is not answered: can you provide 
example of algorithms and use cases that don't need grapheme 
segmentation?



More information about the Digitalmars-d mailing list