String implementations

Sat Jan 19 12:21:49 PST 2008

On 1/19/08, James Dennett <jdennett at acm.org> wrote:
> >>>     char[] a = "abcd";
> >>>     char[] b = "\u20AC";
> >>>     a[0] = b[0];
>
> I think we are *disagreeing* here.  I claim that this causes
> problems _with the current design of D_, which would be
> resolved if char[] (or however we denote mutable UTF8 strings)
> string were really a UTF8 type.

So you're saying that in your new design, after that assignment, a
would equal "\u20ACbcd". The problem is that the compiler would have
to allocate extra bytes and then memcpy all the bytes up a bit to make
room. That strikes me as kinda slow, which is not something I'd want
in a char array.

> That's the problem.  char[] can hold non-UTF8 strings.

Yes, that is possible. But only in buggy code, of course. That really
raises the question: is it the compiler's job, or the programmer's, to
ensure that the contract is maintained? I don't really have any
problem taking responsibility for maintaining UTF-8 correctness. (It's
not hard).

But if you want to be completely protected from those kinds of errors,
I still don't see the problem with using dchar.

> No, it does not.  It's precisely the difference that is why
> D's char[] is a poor man's UTF8 string.

I suppose a library class could be written whose interface behaved
like a dchar array, but whose implementation was UTF-8. But when you
ever use it?