Making all strings UTF ranges has some risk of WTF

Michael Rynn michaelrynn at optusnet.com.au
Fri Feb 5 04:27:12 PST 2010


On Thu, 04 Feb 2010 18:41:48 -0700, Rainer Deyke wrote:

> Andrei Alexandrescu wrote:
>> One idea I've had for a while was to have a universal string type:
>> 
>> struct UString {
>>     union {
>>         char[] utf8;
>>         wchar[] utf16;
>>         dchar[] utf32;
>>     }
>>     enum Discriminator { utf8, utf16, utf32 }; Discriminator kind;
>>     IntervalTree!(size_t) skip;
>>     ...
>> }
>> 


Firstly, for such "augmented types" in D, such as strings, bignums or any 
future ideas , it is great to have the facilities of creating them using 
the struct, so that they can be used elsewhere without regards to whether 
they are built in as compiler specials or in the library.

What is there for struct now is good and getting better in D2, but I 
still feel a little insecure with understanding how to make a really 
optimal implementation that is as good as a built in type that the 
compiler understands. The DPL is being been a help for this.

Programmers will want to use raw char[] wchar[] dchar[] for whatever 
reasons with their ?simple? behaviours, so they should not be made 
unavailable because more sophisticated types are creatable, purely for 
unicode strings.

I have made a UString implementation, similar to above.  But I played a 
different trick. I was interested for this to also maintain a terminating 
null char for conversion passing to Windows API functions, in particular 
16 bit W. interfaces.

struct UString_char {
	char[]   str_;
	/// ... lots of  good D type stuff, constructor and assign 
conversions, access

	size_t length() {
		return str_.length - 1; // hide terminating null
	}
}

struct UString_wchar {
	wchar[]   str_;
	/// ditto D type stuff
}

struct UString_dchar {
	dchar[]   str_;
}

// throw in void[] for charity. (although no one will need it)

struct UString_void {
	void[]  ptr_;
}
enum UStringType { UC_CHAR, UC_WCHAR, UC_DCHAR }
	
struct UString {

	
	union {
		UString_void vstr;
		UString_char cstr;
		UString_wchar wstr;
		UString_dchar dstr;
	}
	UStringType	ztype;
	
	// type things to track what we are.
}

I could then choose individual components by themselves, where 
appropriate, even get them working.

In D2 immutable not for str_ array, while appending or fiddling null 
terminator.

I did not get associative array working as a key using UString, have not 
tried since.

Also made a class version called VString containing the union. 
There's a lot of issues.

I also must acknowledge the prior art of the mtext code, and its MString 
structure type.  I was partly inspired by seeing this, and how complex it 
was to do nearly everything.  

When last I checked mtext it was kind of broken for recent D1 and D2 
compilers, and I did not want to fix.  I admit I did not like the 
complexity of the direct union { char[], wchar[], dchar[] }  Splitting up 
into seperatedly usable structs seems to me to give 3 times the potential 
for the same price.

The advantage of using struct for such types is it may help bring about 
perfection of such a POD based "type creation" facility. I note from 
looking at some of the phobos D2 code, eg std.array, this seems to be 
attempted in places.

Nearly all the more interesting D types, arrays, maps, are all equivalent 
to smallish POD types, with at least 2-3 times the machine word size 
(32/64 bit).

Making it all work and understandable and avoiding WTFbug is a big 
challenge.








More information about the Digitalmars-d mailing list