String implementations

Thu Jan 17 13:40:12 PST 2008

Jarrod wrote:
> While the topic is at hand, I guess I could rant a little;
> Why does D practically *require* the coder to use different forms of UTF 
> encoding.

Because I've worked with internationalized code in C/C++ where the 
encoding isn't specified, and it's very bad.

> D can tell if a code unit spans multiple bytes, as evidenced in 
> converting a utf-8 string to utf-32 (D knows where to split the blocks 
> apart), yet we can't index char[] arrays by code units. Instead, D will 
> index char arrays by fixed length bytes, which is almost nonsensical

It is impractical (i.e. very inefficient) to index arrays otherwise, 
especially in getting array lengths, doing slicing, etc. In fact, it is 
rather rare to index by code units. The times you might want to do it 
are easily handled by foreach(dchar c, string).

> since the D spec asserts that char[] arrays are designed specifically for 
> unicode characters, and that other single byte arrays should instead be 
> made as a byte[].
> So if this is the case, then why can't the language itself manage multi-
> byte characters for us?

It does, see foreach. In general, I don't think it's a good idea for the 
language to try to completely hide the multibyte nature of UTF. For 
example, when you're allocating and copying strings around, you need the 
byte length, not the number of code points.

> It would make things a hell of a lot easier and 
> more efficient than having to convert /potentially/ foreign strings to 
> utf-32 for a simple manipulation operation, then converting them back.
> The only reason I can think of for char arrays being treated as fixed 
> length is for faster indexing, which is hardly useful in most cases since 
> a lot of the time we don't even know if we're dealing with multi-byte 
> characters when handling strings, so we have to convert and traverse the 
> strings anyway.
> Arg.

I was surprised to discover that most indexing work in strings, such as 
searching, work more efficiently by *not* trying to index by code 
points. There are standard library functions in std.utf to index by code 
points, if you do need it.

> I know this may probably be a pain to implement, but it would really give 
> D a huge leg-up if it could properly and automatically handle strings for 
> us. Without requiring a bloated string class.

I believe D already has found the right approach to handling UTF 
strings. All I can say is try it out for a while.