The Case Against Autodecode
Marc Schütz via Digitalmars-d
digitalmars-d at puremagic.com
Thu Jun 2 04:24:53 PDT 2016
On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu
wrote:
> On 06/01/2016 06:25 AM, Marc Schütz wrote:
>> On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu
>> wrote:
>>> The point is to operate on representation-independent entities
>>> (Unicode code points) instead of low-level
>>> representation-specific
>>> artifacts (code units).
>>
>> _Both_ are low-level representation-specific artifacts.
>
> Maybe this is a misunderstanding. Representation = how things
> are laid out in memory. What does associating numbers with
> various Unicode symbols have to do with representation? --
Ok, if you define it that way, sure. I was thinking in terms of
the actual text: Unicode is a way to represent that text using a
variety of low-level representations: UTF8/NFC, UTF8/NFD,
unnormalized UTF8, UTF16 big/little endian x normalization, UTF32
x normalization, some other more obscure ones. From that
viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[]
(= UTF32). Neither of them is the actual text.
Both writing and the memory representation consist of fundamental
units. But there is no 1:1 relationship between the units of
char[] (UTF8 code units) or auto decoded strings (Unicode code
points) on the one hand, and the units of writing (graphemes) on
the other.
More information about the Digitalmars-d
mailing list