Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Feb 4 15:16:55 PST 2010


Rainer Deyke wrote:
> Don wrote:
>> I suspect that string, wstring should have been the primary types and
>> had a .codepoints property, which returned a ubyte[] resp. ushort[]
>> reference to the data. It's too late, of course. The extra value you get
>> by having a specific type for 'this is a code point for a UTF8 string'
>> seems to be very minor, compared to just using a ubyte.
> 
> If it's not too late to completely change the semantics of char[], then
> it's also not too late to dump 'char' completely.  If it /is/ too late
> to remove 'char', then 'char[]' should retain the current semantics and
> a new string type should be added for the new semantics.

One idea I've had for a while was to have a universal string type:

struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
}

The IntervalTree stores the skip amounts that must be added for a given 
index in the string. For ASCII strings that would be null. Then its size 
grows with the number of multibyte characters. Beyond a threshold, 
representation is transparently switched to utf16 or utf32 as needed and 
the tree becomes smaller or null again.

In an advanced implementation the discriminator and the tree could be 
stored at negative offset, and the tree could be compressed taking 
advantage of its limited size. That would make UString quite 
low-overhead while offering a staunchly dchar-based interface.

I don't mind at all using string, but I also think UString would be a 
good extra abstraction.


Andrei



More information about the Digitalmars-d mailing list