String implementations

Sun Jan 20 14:55:26 PST 2008

Janice Caron wrote:
> On 1/20/08, James Dennett <jdennett at acm.org> wrote:
>> Looks very different to me.
> 
> I thought it looked very similar indeed to D, but there you go. Funny
> how two different people can read the same document and interpret it
> in two different ways.

The core issue here, to me, is D's half-hearted attempt to paint
char[] as a Unicode string type.  C++ has nothing analagous.

>> There's no conflation of char with a
>> code unit of UTF8
> 
> C has no ubyte type. Since time immemorial, C programmers have been
> using the char type to store every 8-bit wide data type under the sun
> simply because there's been no alternative (until recently, when
> int8_t showed up as a typedef for char).

int8_t is necessarily signed, a la "signed char", not a typedef
for "char", whose signedness varies (but, unfortunately, is often
signed in C and C++).

> That's not a big deal.
> 
> 
>> (and indeed C++ deliberately supports use of
>> varied encodings for multi-byte characters).
> 
> I must have misread the heading that says "Require UTF", and whose
> text reads "The C TR makes the encoding of char16_t and char32_t
> implementation-defined. It also provides macros to indicate whether or
> not the encoding is UTF. In contrast, this proposal requires UTF
> encoding."
> 
> Oh, I see what you're saying - C++ would require UTF for wchar and
> dchar, but not for char. Well, that's historical legacy for you.

And it's the real world; computer systems need to interface
with existing systems which us diverse encodings.

>> Yes, C++ is adding
>> 16- and 32-bit character types which are more akin to D's, but that
>> has little bearing on how differently it handles multi-byte (as
>> opposed to wide-character) strings.
> 
> So it has a bunch of procedural functions instead of foreach. Apart
> from that, the approach seems the same as D. Where's the difference?

Philosophy: D pushes char[] as if it were a proper UTF8 facility,
and goes a small step towards adding language support for that.

C++ recognizes diversity in multi-byte character encodings, and
doesn't make the language promote one over any other.  It admits
up-front that you're dealing with code units if you want to work
with multi-byte characters.

C++ is a long, long way from perfect when it comes to Unicode
support.  Even C++0x will be.  But I'm hoping for more from D,
and what I see so far can stand some improvement.

-- James