Fix Phobos dependencies on autodecoding

Thu Aug 15 19:05:32 UTC 2019

On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
> From the examples above, most of the time you simply need 
> opaque memory management, so decaying the 
> string/dstring/wstring to a binary blob, but that's not string 
> processing

This is the point we're trying to get across to you: this isn't 
sufficient. Depending on the context and the script/language, you 
need access to the string at various levels. E.g. a font renderer 
needs to sometimes iterate code points, not graphemes in order to 
compose the correct glyphs.

Binary blob comparisons for comparing strings are *also* not 
sufficient, again depending on both script/language of the text 
in the string and the context in which the comparison is 
performed. If the comparison is to be purely semantic, the 
following strings should be equal: "\u00f6" and "\u006f\u0308". 
They both represent the same "Latin Small Letter O with 
Diaeresis". Their in-memory representations clearly aren't equal, 
so a memcpy won't yield the correct result. The same applies to 
sorting.

If you decide to force a specific string normalization 
internally, you put the burden on the user to explicitly select a 
different normalization when they require it. Plus, there is no 
way to perfectly reconstruct the input binary representation of a 
string, e.g. when it was given in a non-normalized form (e.g. a 
mix of NFC and NFD). Once such a string is through a 
normalization algorithm, the exact input is unrecoverable. This 
makes interfacing with other code that has idiosyncrasies around 
all of this hard to impossible to achieve.

One such system that I worked on in the past was a small embedded 
microcontroller driven HCI module with very limited capabilites, 
but with the requirement to be multilingual. I carefully worked 
out that for the languages that were required, a UTF-8 encoding 
with a very specific normalization would just about work. This 
choice was viable because the user interface was created in a 
custom tool where I could control the code and data generation 
just enough to make it work.

Another case where normalization is troublesome is ligatures. 
Ligatures that are purely stylistic like "ff", "ffi", "fft", 
"st", "ct" etc... have their own code points. Yet, it is a purely 
stylistic choice whether to use them. So in terms of the 
contained text, the ligature \ufb00 is equal to the string "ff", 
but it is not the same grapheme. Whether you can normalize this 
depends on the context. The user may have selected the ligature 
representation deliberately to have it appear as such on screen. 
If you want to do spell checking on the other hand, you would 
need to resolve the ligature to its individual letters.

And then there is Hangul: this is a prime example of a writing 
system that is "weird" to westerners. It is based on 40 symbols 
(19 consonants, 21 vowels) which aren't written individually, but 
merged syllable by syllable into rectangular blocks of two or 
three such symbols. These symbols get arranged in different 
layouts depending on which symbols there are in a syllable. As 
far as I understand, this follows a clear algorithm. This results 
in approximately 6500 individual graphemes that are actually 
written. Yet each of these is a group of two or three letters and 
parsed as such. So depending on whether you're interested in 
individual letters or syllables, you need to use a different 
string representation for processing that language.

OK, this are all just examples that come to my mind while 
brainstorming the question a little bit. However, none of us are 
not experts in language processing, so whatever examples we can 
come up with are very likely just the very tip of the iceberg. 
There is a reason why libraries like ICU give the user a lot of 
control over string handling and expose a lot of variants of 
functions depending on the user intent and context. This design 
rests on a lot of expert knowledge that we don't have, but we 
know that it is sound. Going against that wisdom is inviting 
trouble. Autodecoding is an example of doing just that.