String implementations
Walter Bright
newshound1 at digitalmars.com
Sun Jan 20 12:20:02 PST 2008
James Dennett wrote:
> I've given specific problems with it. I've heard no refutation
> of them.
It's hard to describe, but after working with UTF-8 for a while, they
are just non-problems. Code isn't written that way.
If you want, you can create a String class which wraps a char[] and
treats it at the level you wish.
> D uses essentially a model of UTF8 which is really just
> a bunch-of-bytes with smart iteration.
That's what UTF-8 is.
> C-based projects on which
> I worked in the 90's did similarly, but with coding conventions
> that banned direct access to the bytes.
Coding conventions are one thing, but banning things in a systems
language are quite another. Copying a UTF-8 string by decoding and
encoding the characters one-by-one is unacceptably inefficient, for
example, compared with just memcpy. Searching a UTF-8 string for a
substring is another operation for which treating it like a bag of bytes
works best.
>> This is why you'll have a hard time persuading me otherwise <g>.
> Because you assert that there's not a problem? ;)
Because I know it works based on experience.
>> Note that C++0x is doing things similarly:
>>
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html
>
> Looks very different to me. There's no conflation of char with a
> code unit of UTF8 (and indeed C++ deliberately supports use of
> varied encodings for multi-byte characters). Yes, C++ is adding
> 16- and 32-bit character types which are more akin to D's, but that
> has little bearing on how differently it handles multi-byte (as
> opposed to wide-character) strings.
Since, in the C++ proposal, indexing and length is done by
byte/word/dword, not by code point, it's semantically equivalent. I
don't see any banning of getting at the underlying representation, nor
any attempt to hide it.
More information about the Digitalmars-d
mailing list