The Case Against Autodecode

Thu Jun 2 04:24:53 PDT 2016

On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu 
wrote:
> On 06/01/2016 06:25 AM, Marc Schütz wrote:
>> On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu 
>> wrote:
>>> The point is to operate on representation-independent entities
>>> (Unicode code points) instead of low-level 
>>> representation-specific
>>> artifacts (code units).
>>
>> _Both_ are low-level representation-specific artifacts.
>
> Maybe this is a misunderstanding. Representation = how things 
> are laid out in memory. What does associating numbers with 
> various Unicode symbols have to do with representation? --

Ok, if you define it that way, sure. I was thinking in terms of 
the actual text: Unicode is a way to represent that text using a 
variety of low-level representations: UTF8/NFC, UTF8/NFD, 
unnormalized UTF8, UTF16 big/little endian x normalization, UTF32 
x normalization, some other more obscure ones. From that 
viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[] 
(= UTF32). Neither of them is the actual text.

Both writing and the memory representation consist of fundamental 
units. But there is no 1:1 relationship between the units of 
char[] (UTF8 code units) or auto decoded strings (Unicode code 
points) on the one hand, and the units of writing (graphemes) on 
the other.