String implementations

Sat Jan 19 10:14:19 PST 2008

Janice Caron wrote:
> On 1/19/08, James Dennett <jdennett at acm.org> wrote:
>>> Because, think about this:
>>>
>>>     char[] a = new char[8];
>>>
>>> If a char array were indexed by character instead of codeunit, as you
>>> suggest, how many bytes would the compiler need to allocate?
>> 8.  That's independent of how they are indexed.
> 
> My *rhetorical* question was in response to a poster who suggested
> that char arrays should be indexed by CHARACTER. 

Yes, I know.  If you don't see that my post answered yours,
feel free to ask me to explain more clearly.

Indexing by character doesn't automatically tell us anything
about how the size specified for new[] works.  It would be a
little quirky for the size to be in bytes while the indices
are in characters, but it's quirky for char[] to pretend to
be a UTF8 type without enforcing that.

> I know perfectly well
> that in reality they are indexed by UTF-8 code unit, and therefore
> that the correct answer is eight. I was giving additional reasons
> /why/ indexing by character was not sensible.

I was refuting your claimed reason.

> 
>>> It can't
>>> know in advance.
>> Yup, it can.  8.
> 
> Of course. However, /if/ they were indexed by character, /then/ it
> would be impossible to know in advance how many UTF-8 code units it
> would take to construct eight characters.

Sure.  So it wouldn't be a request for 8 characters.

> Which is a good reason why
> they should /not/ be indexed by character.

I *still* disagree with that.

> As far as I am concerned, D has got it right.

Enjoy ;)

>>> Also:
>>>
>>>     char[] a = "abcd";
>>>     char[] b = "\u20AC";
>>>     a[0] = b[0];
>>>
>>> would cause big problems. (Would a[1] get overwritten? Would a have to
>>> be resized and everything shifted up one byte?)
>> It causes big problems with byte-wise addressing, because you
>> can end up with a char[] (which is specified to hold a UTF8
>> string) which does not contain a UTF8 string.
> 
> That's what I said. Thank you for reiterating it.

I think we are *disagreeing* here.  I claim that this causes
problems _with the current design of D_, which would be
resolved if char[] (or however we denote mutable UTF8 strings)
string were really a UTF8 type.

>>> I think D has got it right. Use wchar or dchar when you need character
>>> based indexing.
>> If you have UTF8, you should not be allowed to access the bytes
>> that make it up, only its characters.
> 
> I absolutely /should/ be (and am) allowed to access the UTF-8 code
> units stored within an array of UTF-8 code units. This is absolutely
> as it should be, thank you very much.

But you already illustrated why it should not: you can break
the invariant that there is valid UTF8 there, so char[] is
lying when it says that it is a UTF8 type.

> 
>> If you want a bunch of
>> bytes, use an array of bytes.
> 
> Of course. And if you want an array of UTF-8 code units, use a char[]
> or a string.

If you're at the level of code units, you can't be sure that
they're part of UTF8 characters unless there is a higher-level
invariant enforced on them.

>> (On the other hand, D is weak here
>> because it identifies UTF8 strings with arrays of char, but char
>> doesn't hold a UTF8 character.
> 
> I believe you are wrong. I would say that D is strong here, because it
> identifies UTF-8 strings with arrays of char, because char /does/ hold
> a UTF-8 code unit.

That's the problem.  char[] can hold non-UTF8 strings.

> The mistake is assuming that code unit == character.

If you're telling me that I made that mistake, you're
sorely mistaken.

> It does not.
> (However, there is overlap in the ASCII range, 0x00 to 0x7F, so the
> assumption is still valid if you are certain that your strings are
> ASCII).
> 
> Once you've grokked that char[] == array of UTF-8 code unit,
> everything else falls into place and makes sense.

No, it does not.  It's precisely the difference that is why
D's char[] is a poor man's UTF8 string.

-- James