String implementations

Sun Jan 20 15:01:40 PST 2008

Walter Bright wrote:
> James Dennett wrote:
>> I've given specific problems with it.  I've heard no refutation
>> of them.
> 
> It's hard to describe, but after working with UTF-8 for a while, they 
> are just non-problems. Code isn't written that way.
> 
> If you want, you can create a String class which wraps a char[] and 
> treats it at the level you wish.

Indeed, but such a thing should be standard, not reinvented
over and over.

>> D uses essentially a model of UTF8 which is really just
>> a bunch-of-bytes with smart iteration.
> 
> That's what UTF-8 is.

That view has lead to many security issues, where different
software reacts differently to byte strings which are not
valid UTF-8 in places where UTF-8 is expected.

>> C-based projects on which
>> I worked in the 90's did similarly, but with coding conventions
>> that banned direct access to the bytes.
> 
> Coding conventions are one thing, but banning things in a systems 
> language are quite another. Copying a UTF-8 string by decoding and 
> encoding the characters one-by-one is unacceptably inefficient, for 
> example, compared with just memcpy. Searching a UTF-8 string for a 
> substring is another operation for which treating it like a bag of bytes 
> works best.

There are alternatives; explicit notation to access the bytes,
which *doesn't* look like it's accessing characters, would be
better.  (char doesn't represent a character in D.  Not great
naming?  But then D almost follows C in this, where char did
double duty as a limited character type and a small integral
type.)

>>> This is why you'll have a hard time persuading me otherwise <g>.
>> Because you assert that there's not a problem? ;)
> 
> Because I know it works based on experience.

And I know, based on experience, of problems with it.  So how
do we get past this to discuss things more objectively?

(Of course, we don't have to.  You're the BDFL, and you get to
make the call, and try to keep D coherent in the face of a hundred
people pushing inconsistent views for how it should evolve.  I get
the easy job of being just one of those voices.)

>>> Note that C++0x is doing things similarly:
>>>
>>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
>>
>> Looks very different to me.  There's no conflation of char with a
>> code unit of UTF8 (and indeed C++ deliberately supports use of
>> varied encodings for multi-byte characters).  Yes, C++ is adding
>> 16- and 32-bit character types which are more akin to D's, but that
>> has little bearing on how differently it handles multi-byte (as
>> opposed to wide-character) strings.
> 
> Since, in the C++ proposal, indexing and length is done by 
> byte/word/dword, not by code point, it's semantically equivalent. I 
> don't see any banning of getting at the underlying representation, nor 
> any attempt to hide it.

Whereas D partly attempts to hide it; the mathematician in me
hates this kind of fence-sitting.  But let's get more concrete:
suppose D code finds that an alleged char[] passed to it is, in
fact, broken (i.e., violates the UTF8 invariants).  What should
it do -- abort, throw an exception, offer a policy for handling
such bugs, other?

-- James