String implementations
James Dennett
jdennett at acm.org
Fri Jan 18 21:38:36 PST 2008
Janice Caron wrote:
> On 1/16/08, Jarrod <qwerty at ytre.wq> wrote:
>> On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:
>>
>> So if this is the case, then why can't the language itself manage multi-
>> byte characters for us? It would make things a hell of a lot easier and
>> more efficient than having to convert /potentially/ foreign strings to
>> utf-32 for a simple manipulation operation, then converting them back.
>> The only reason I can think of for char arrays being treated as fixed
>> length is for faster indexing, which is hardly useful in most cases since
>> a lot of the time we don't even know if we're dealing with multi-byte
>> characters when handling strings, so we have to convert and traverse the
>> strings anyway.
>
> Because, think about this:
>
> char[] a = new char[8];
>
> If a char array were indexed by character instead of codeunit, as you
> suggest, how many bytes would the compiler need to allocate?
8. That's independent of how they are indexed.
> It can't
> know in advance.
Yup, it can. 8.
> Also:
>
> char[] a = "abcd";
> char[] b = "\u20AC";
> a[0] = b[0];
>
> would cause big problems. (Would a[1] get overwritten? Would a have to
> be resized and everything shifted up one byte?)
It causes big problems with byte-wise addressing, because you
can end up with a char[] (which is specified to hold a UTF8
string) which does not contain a UTF8 string.
> I think D has got it right. Use wchar or dchar when you need character
> based indexing.
If you have UTF8, you should not be allowed to access the bytes
that make it up, only its characters. If you want a bunch of
bytes, use an array of bytes. (On the other hand, D is weak here
because it identifies UTF8 strings with arrays of char, but char
doesn't hold a UTF8 character. I can't imagine persuading Walter
that this is a horrible error is going to work though.)
-- James
More information about the Digitalmars-d
mailing list