String implementations

Fri Jan 18 21:38:36 PST 2008

Janice Caron wrote:
> On 1/16/08, Jarrod <qwerty at ytre.wq> wrote:
>> On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:
>>
>> So if this is the case, then why can't the language itself manage multi-
>> byte characters for us? It would make things a hell of a lot easier and
>> more efficient than having to convert /potentially/ foreign strings to
>> utf-32 for a simple manipulation operation, then converting them back.
>> The only reason I can think of for char arrays being treated as fixed
>> length is for faster indexing, which is hardly useful in most cases since
>> a lot of the time we don't even know if we're dealing with multi-byte
>> characters when handling strings, so we have to convert and traverse the
>> strings anyway.
> 
> Because, think about this:
> 
>     char[] a = new char[8];
> 
> If a char array were indexed by character instead of codeunit, as you
> suggest, how many bytes would the compiler need to allocate?

8.  That's independent of how they are indexed.

> It can't
> know in advance. 

Yup, it can.  8.

> Also:
> 
>     char[] a = "abcd";
>     char[] b = "\u20AC";
>     a[0] = b[0];
> 
> would cause big problems. (Would a[1] get overwritten? Would a have to
> be resized and everything shifted up one byte?)

It causes big problems with byte-wise addressing, because you
can end up with a char[] (which is specified to hold a UTF8
string) which does not contain a UTF8 string.

> I think D has got it right. Use wchar or dchar when you need character
> based indexing.

If you have UTF8, you should not be allowed to access the bytes
that make it up, only its characters.  If you want a bunch of
bytes, use an array of bytes.  (On the other hand, D is weak here
because it identifies UTF8 strings with arrays of char, but char
doesn't hold a UTF8 character.  I can't imagine persuading Walter
that this is a horrible error is going to work though.)

-- James