The Case Against Autodecode

Fri Jun 3 05:04:39 PDT 2016

On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:
> On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via 
> Digitalmars-d wrote:
>> On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
>> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> >> However, this
>> >> meant that some precomposed characters were "redundant": 
>> >> they
>> >> represented character + diacritic combinations that could
>> >> equally well
>> >> be expressed separately. Normalization was the inevitable
>> >> consequence.
>> >
>> > It is not inevitable. Simply disallow the 2 codepoint 
>> > sequences - the single one has to be used instead.
>> >
>> > There is precedent. Some characters can be encoded with more 
>> > than one UTF-8 sequence, and the longer sequences were 
>> > declared invalid. Simple.
>> >
>> > I.e. have the normalization up front when the text is 
>> > created rather than everywhere else.
>>
>> I don't think it would work (or at least, the analogy doesn't 
>> hold). It would mean that you can't add new precomposited 
>> characters, because that means that previously valid sequences 
>> are now invalid.
>
> I would have argued that no composited characters should have 
> ever existed regardless of what was done in previous encodings, 
> since they're redundant, and you need the non-composited 
> characters to avoid a combinatorial explosion of characters, so 
> you can't have characters that just have a composited version 
> and be consistent. However, the Unicode folks obviously didn't 
> go that route. But given where we sit now, even though we're 
> stuck with some composited characters, I'd argue that we should 
> at least never add any new ones. But who knows what the Unicode 
> folks are actually going to do.
>
> As it is, you probably should normalize strings in many cases 
> where they enter the program, just like ideally, you'd validate 
> them when they enter the program. But regardless, you have to 
> deal with the fact that multiple normalization schemes exist 
> and that there's no guarantee that you're even going to get 
> valid Unicode, let alone Unicode that's normalized the way you 
> want.
>
> - Jonathan M Davis

I do exactly this. Validate and normalize.