The Case Against Autodecode

Fri Jun 3 03:10:18 PDT 2016

On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> However, this
>> meant that some precomposed characters were "redundant": they
>> represented character + diacritic combinations that could 
>> equally well
>> be expressed separately. Normalization was the inevitable 
>> consequence.
>
> It is not inevitable. Simply disallow the 2 codepoint sequences 
> - the single one has to be used instead.
>
> There is precedent. Some characters can be encoded with more 
> than one UTF-8 sequence, and the longer sequences were declared 
> invalid. Simple.
>
> I.e. have the normalization up front when the text is created 
> rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't 
hold). It would mean that you can't add new precomposited 
characters, because that means that previously valid sequences 
are now invalid.