Making all strings UTF ranges has some risk of WTF

Thu Feb 4 14:03:15 PST 2010

Don Wrote:
> We seem to be approaching the point where char[], wchar[] and dchar[] 
> are all arrays of dchar, but with different levels of compression.
> It makes me wonder if the char, wchar types actually make any sense.
> If char[] is actually a UTF string, then char[] ~ char should be 
> permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
> you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
> not necessarily result in a valid unicode string.

Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8 text.  Yes, this does happen.  You better be able to handle it in a sane manner, i.e. not reallocating the memory to read the data in.  So, there is a definite need for casting to array of char, and dealing with the inevitable stray non-unicode char in that mess.  

Real-world text processing can be a messy affair.  It probably requires walking such an array and returning slices cast to char after they've been validated.