Making all strings UTF ranges has some risk of WTF
Jerry Quinn
jlquinn at optonline.net
Thu Feb 4 14:03:15 PST 2010
Don Wrote:
> We seem to be approaching the point where char[], wchar[] and dchar[]
> are all arrays of dchar, but with different levels of compression.
> It makes me wonder if the char, wchar types actually make any sense.
> If char[] is actually a UTF string, then char[] ~ char should be
> permitted ONLY if char can be implicitly converted to dchar. Otherwise,
> you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
> not necessarily result in a valid unicode string.
Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8 text. Yes, this does happen. You better be able to handle it in a sane manner, i.e. not reallocating the memory to read the data in. So, there is a definite need for casting to array of char, and dealing with the inevitable stray non-unicode char in that mess.
Real-world text processing can be a messy affair. It probably requires walking such an array and returning slices cast to char after they've been validated.
More information about the Digitalmars-d
mailing list