Fix Phobos dependencies on autodecoding
Argolis
argolis at gmail.com
Thu Aug 15 10:26:12 UTC 2019
On Wednesday, 14 August 2019 at 09:29:30 UTC, Gregor Mückl wrote:
> There is no single universally correct way to segment a string.
> Grapheme segmentation requires a correct assumption of the text
> encoding in the string and also the assumption that the
> encoding is flawless. Neither may be guaranteed in general.
> There is a lot of ways to corrupt UTF-8 strings, for example.
Are you meaning that there's no way to verify that assumptions?
Sorting algorithms in Phobos are returning a SortedRange.
> And then there is a question of the length of a grapheme: IIRC
> they can consist of up to 6 or 7 code points with each of them
> encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2.
> So what data type do you use for representing graphemes then
> that is both not wasteful and doesn't require dynamic memory
> management?
It's performance the rationale of not using dynamic memory
management, if that it's unavoidable to have a correct behaviour?
> Then there are other nasty quirks around graphemes: their
> encoding is not unique. This Unicode TR gives a good impression
> of how complex this single aspect is:
> https://unicode.org/reports/tr15/
> So if you want to use graphemes, do you want to keep the
> original encoding or do you implicitly convert them to NFC or
> NFD? NFC tends to be better for language processing, NFD tends
> to be better for text rendering (with exceptions). If you don't
> normalize, semantically equivalent graphemes may not be equal
> under comparison.
It's performance the rationale of not using normalisation, that
solves all the problems you have mentioned above?
> At this point you're probably approaching the complexity of
> libraries like ICU. You can take a look at it if you want a
> good scare. ;)
The original question still is not answered: can you provide
example of algorithms and use cases that don't need grapheme
segmentation?
More information about the Digitalmars-d
mailing list