Making all strings UTF ranges has some risk of WTF
Michael Rynn
michaelrynn at optusnet.com.au
Fri Feb 5 04:27:12 PST 2010
On Thu, 04 Feb 2010 18:41:48 -0700, Rainer Deyke wrote:
> Andrei Alexandrescu wrote:
>> One idea I've had for a while was to have a universal string type:
>>
>> struct UString {
>> union {
>> char[] utf8;
>> wchar[] utf16;
>> dchar[] utf32;
>> }
>> enum Discriminator { utf8, utf16, utf32 }; Discriminator kind;
>> IntervalTree!(size_t) skip;
>> ...
>> }
>>
Firstly, for such "augmented types" in D, such as strings, bignums or any
future ideas , it is great to have the facilities of creating them using
the struct, so that they can be used elsewhere without regards to whether
they are built in as compiler specials or in the library.
What is there for struct now is good and getting better in D2, but I
still feel a little insecure with understanding how to make a really
optimal implementation that is as good as a built in type that the
compiler understands. The DPL is being been a help for this.
Programmers will want to use raw char[] wchar[] dchar[] for whatever
reasons with their ?simple? behaviours, so they should not be made
unavailable because more sophisticated types are creatable, purely for
unicode strings.
I have made a UString implementation, similar to above. But I played a
different trick. I was interested for this to also maintain a terminating
null char for conversion passing to Windows API functions, in particular
16 bit W. interfaces.
struct UString_char {
char[] str_;
/// ... lots of good D type stuff, constructor and assign
conversions, access
size_t length() {
return str_.length - 1; // hide terminating null
}
}
struct UString_wchar {
wchar[] str_;
/// ditto D type stuff
}
struct UString_dchar {
dchar[] str_;
}
// throw in void[] for charity. (although no one will need it)
struct UString_void {
void[] ptr_;
}
enum UStringType { UC_CHAR, UC_WCHAR, UC_DCHAR }
struct UString {
union {
UString_void vstr;
UString_char cstr;
UString_wchar wstr;
UString_dchar dstr;
}
UStringType ztype;
// type things to track what we are.
}
I could then choose individual components by themselves, where
appropriate, even get them working.
In D2 immutable not for str_ array, while appending or fiddling null
terminator.
I did not get associative array working as a key using UString, have not
tried since.
Also made a class version called VString containing the union.
There's a lot of issues.
I also must acknowledge the prior art of the mtext code, and its MString
structure type. I was partly inspired by seeing this, and how complex it
was to do nearly everything.
When last I checked mtext it was kind of broken for recent D1 and D2
compilers, and I did not want to fix. I admit I did not like the
complexity of the direct union { char[], wchar[], dchar[] } Splitting up
into seperatedly usable structs seems to me to give 3 times the potential
for the same price.
The advantage of using struct for such types is it may help bring about
perfection of such a POD based "type creation" facility. I note from
looking at some of the phobos D2 code, eg std.array, this seems to be
attempted in places.
Nearly all the more interesting D types, arrays, maps, are all equivalent
to smallish POD types, with at least 2-3 times the machine word size
(32/64 bit).
Making it all work and understandable and avoiding WTFbug is a big
challenge.
More information about the Digitalmars-d
mailing list