The Case Against Autodecode

Thu Jun 2 15:27:16 PDT 2016

On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
> On 6/2/2016 12:34 PM, deadalnix wrote:
>> On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
>> wrote:
>>> Pretty much everything. Consider s and s1 string variables 
>>> with possibly
>>> different encodings (UTF8/UTF16).
>>>
>>> * s.all!(c => c == 'ö') works only with autodecoding. It 
>>> returns always false
>>> without.
>>>
>>
>> False. Many characters can be represented by different 
>> sequences of codepoints.
>> For instance, ê can be ê as one codepoint or ^ as a modifier 
>> followed by e. ö is
>> one such character.
>
> There are 3 levels of Unicode support. What Andrei is talking 
> about is Level 1.
>
> http://unicode.org/reports/tr18/tr18-5.1.html
>
> I wonder what rationale there is for Unicode to have two 
> different sequences of codepoints be treated as the same. It's 
> madness.

There are languages that make heavy use of diacritics, often 
several on a single "character". Hebrew is a good example. Should 
there be only one valid ordering of any given set of diacritics 
on any given character? It's an interesting idea, but it's not 
how things are.