String implementations

Sat Jan 19 01:24:29 PST 2008

On 1/19/08, James Dennett <jdennett at acm.org> wrote:
> > Because, think about this:
> >
> >     char[] a = new char[8];
> >
> > If a char array were indexed by character instead of codeunit, as you
> > suggest, how many bytes would the compiler need to allocate?
>
> 8.  That's independent of how they are indexed.

My *rhetorical* question was in response to a poster who suggested
that char arrays should be indexed by CHARACTER. I know perfectly well
that in reality they are indexed by UTF-8 code unit, and therefore
that the correct answer is eight. I was giving additional reasons
/why/ indexing by character was not sensible.

> > It can't
> > know in advance.
>
> Yup, it can.  8.

Of course. However, /if/ they were indexed by character, /then/ it
would be impossible to know in advance how many UTF-8 code units it
would take to construct eight characters. Which is a good reason why
they should /not/ be indexed by character.

As far as I am concerned, D has got it right.

> > Also:
> >
> >     char[] a = "abcd";
> >     char[] b = "\u20AC";
> >     a[0] = b[0];
> >
> > would cause big problems. (Would a[1] get overwritten? Would a have to
> > be resized and everything shifted up one byte?)
>
> It causes big problems with byte-wise addressing, because you
> can end up with a char[] (which is specified to hold a UTF8
> string) which does not contain a UTF8 string.

That's what I said. Thank you for reiterating it.

> > I think D has got it right. Use wchar or dchar when you need character
> > based indexing.
>
> If you have UTF8, you should not be allowed to access the bytes
> that make it up, only its characters.

I absolutely /should/ be (and am) allowed to access the UTF-8 code
units stored within an array of UTF-8 code units. This is absolutely
as it should be, thank you very much.

> If you want a bunch of
> bytes, use an array of bytes.

Of course. And if you want an array of UTF-8 code units, use a char[]
or a string.

> (On the other hand, D is weak here
> because it identifies UTF8 strings with arrays of char, but char
> doesn't hold a UTF8 character.

I believe you are wrong. I would say that D is strong here, because it
identifies UTF-8 strings with arrays of char, because char /does/ hold
a UTF-8 code unit.

The mistake is assuming that code unit == character. It does not.
(However, there is overlap in the ASCII range, 0x00 to 0x7F, so the
assumption is still valid if you are certain that your strings are
ASCII).

Once you've grokked that char[] == array of UTF-8 code unit,
everything else falls into place and makes sense.