Making all strings UTF ranges has some risk of WTF

Thu Feb 4 17:15:29 PST 2010

Andrei Alexandrescu wrote:
> Rainer Deyke wrote:
>> Don wrote:
>>> I suspect that string, wstring should have been the primary types and
>>> had a .codepoints property, which returned a ubyte[] resp. ushort[]
>>> reference to the data. It's too late, of course. The extra value you get
>>> by having a specific type for 'this is a code point for a UTF8 string'
>>> seems to be very minor, compared to just using a ubyte.
>>
>> If it's not too late to completely change the semantics of char[], then
>> it's also not too late to dump 'char' completely.  If it /is/ too late
>> to remove 'char', then 'char[]' should retain the current semantics and
>> a new string type should be added for the new semantics.
> 
> One idea I've had for a while was to have a universal string type:
> 
> struct UString {
>     union {
>         char[] utf8;
>         wchar[] utf16;
>         dchar[] utf32;
>     }
>     enum Discriminator { utf8, utf16, utf32 };
>     Discriminator kind;
>     IntervalTree!(size_t) skip;
>     ...
> }

You mean like this?
http://www.dprogramming.com/mtext.php