String implementations

Sun Jan 20 12:20:02 PST 2008

James Dennett wrote:
> I've given specific problems with it.  I've heard no refutation
> of them.

It's hard to describe, but after working with UTF-8 for a while, they 
are just non-problems. Code isn't written that way.

If you want, you can create a String class which wraps a char[] and 
treats it at the level you wish.

> D uses essentially a model of UTF8 which is really just
> a bunch-of-bytes with smart iteration.

That's what UTF-8 is.

> C-based projects on which
> I worked in the 90's did similarly, but with coding conventions
> that banned direct access to the bytes.

Coding conventions are one thing, but banning things in a systems 
language are quite another. Copying a UTF-8 string by decoding and 
encoding the characters one-by-one is unacceptably inefficient, for 
example, compared with just memcpy. Searching a UTF-8 string for a 
substring is another operation for which treating it like a bag of bytes 
works best.

>> This is why you'll have a hard time persuading me otherwise <g>.
> Because you assert that there's not a problem? ;)

Because I know it works based on experience.

>> Note that C++0x is doing things similarly:
>>
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
> 
> Looks very different to me.  There's no conflation of char with a
> code unit of UTF8 (and indeed C++ deliberately supports use of
> varied encodings for multi-byte characters).  Yes, C++ is adding
> 16- and 32-bit character types which are more akin to D's, but that
> has little bearing on how differently it handles multi-byte (as
> opposed to wide-character) strings.

Since, in the C++ proposal, indexing and length is done by 
byte/word/dword, not by code point, it's semantically equivalent. I 
don't see any banning of getting at the underlying representation, nor 
any attempt to hide it.