Making all strings UTF ranges has some risk of WTF
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Wed Feb 3 22:26:38 PST 2010
Rainer Deyke wrote:
> Andrei Alexandrescu wrote:
>> Arrays of char and wchar are not quite generic - they are definitely UTF
>> strings.
>
> A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a
> generic array of utf-8 code units. Sometimes these code units line up
> to form valid unicode code points, sometimes they don't.
>
> If you want a data type that always contains a valid utf-8 string, don't
> call it 'char[]'. It's misleading, it breaks generic code, and it
> renders built-in arrays useless for when you actually want an array of
> utf-8 code units. It's the same mistake as std::vector<bool> in C++,
> but much worse.
I agree up to the assessment of the size of the problem and a couple of
other points. I've had a great time writing utf code in D with string.
Getting back to C++'s std::string really put things in perspective.
If your purpose is to store some disparate utf-8 code units (a need that
I've never had), I see no problem with storing then as ubyte[].
Andrei
More information about the Digitalmars-d
mailing list