The Case Against Autodecode

John Colvin via Digitalmars-d digitalmars-d at puremagic.com
Thu Jun 2 15:27:16 PDT 2016


On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
> On 6/2/2016 12:34 PM, deadalnix wrote:
>> On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
>> wrote:
>>> Pretty much everything. Consider s and s1 string variables 
>>> with possibly
>>> different encodings (UTF8/UTF16).
>>>
>>> * s.all!(c => c == 'ö') works only with autodecoding. It 
>>> returns always false
>>> without.
>>>
>>
>> False. Many characters can be represented by different 
>> sequences of codepoints.
>> For instance, ê can be ê as one codepoint or ^ as a modifier 
>> followed by e. ö is
>> one such character.
>
> There are 3 levels of Unicode support. What Andrei is talking 
> about is Level 1.
>
> http://unicode.org/reports/tr18/tr18-5.1.html
>
> I wonder what rationale there is for Unicode to have two 
> different sequences of codepoints be treated as the same. It's 
> madness.

There are languages that make heavy use of diacritics, often 
several on a single "character". Hebrew is a good example. Should 
there be only one valid ordering of any given set of diacritics 
on any given character? It's an interesting idea, but it's not 
how things are.


More information about the Digitalmars-d mailing list