Making all strings UTF ranges has some risk of WTF
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Thu Feb 4 15:16:55 PST 2010
Rainer Deyke wrote:
> Don wrote:
>> I suspect that string, wstring should have been the primary types and
>> had a .codepoints property, which returned a ubyte[] resp. ushort[]
>> reference to the data. It's too late, of course. The extra value you get
>> by having a specific type for 'this is a code point for a UTF8 string'
>> seems to be very minor, compared to just using a ubyte.
>
> If it's not too late to completely change the semantics of char[], then
> it's also not too late to dump 'char' completely. If it /is/ too late
> to remove 'char', then 'char[]' should retain the current semantics and
> a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
The IntervalTree stores the skip amounts that must be added for a given
index in the string. For ASCII strings that would be null. Then its size
grows with the number of multibyte characters. Beyond a threshold,
representation is transparently switched to utf16 or utf32 as needed and
the tree becomes smaller or null again.
In an advanced implementation the discriminator and the tree could be
stored at negative offset, and the tree could be compressed taking
advantage of its limited size. That would make UString quite
low-overhead while offering a staunchly dchar-based interface.
I don't mind at all using string, but I also think UString would be a
good extra abstraction.
Andrei
More information about the Digitalmars-d
mailing list