Making all strings UTF ranges has some risk of WTF

Thu Feb 4 17:54:06 PST 2010

Rainer Deyke wrote:
> Andrei Alexandrescu wrote:
>> One idea I've had for a while was to have a universal string type:
>>
>> struct UString {
>>     union {
>>         char[] utf8;
>>         wchar[] utf16;
>>         dchar[] utf32;
>>     }
>>     enum Discriminator { utf8, utf16, utf32 };
>>     Discriminator kind;
>>     IntervalTree!(size_t) skip;
>>     ...
>> }
>>
>> The IntervalTree stores the skip amounts that must be added for a given
>> index in the string. For ASCII strings that would be null. Then its size
>> grows with the number of multibyte characters. Beyond a threshold,
>> representation is transparently switched to utf16 or utf32 as needed and
>> the tree becomes smaller or null again.
> 
> Although I see some potential in a universal string type, I don't think
> this is the right implementation strategy.  I'd rather have my short
> strings in utf-32 (optimized for speed) and my long strings in
> utf-8/utf-16 (optimized for memory usage).

The definition I outlined does not specify or constrain the strategy of 
changing the discriminator.

Andrei