String implementations

Jarrod qwerty at ytre.wq
Wed Jan 16 03:08:29 PST 2008


On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:

> Defining how an ASCII string is best managed by a language is already
> complex (ropes or not? Mutable or not? With shared parts or not? Etc),
> but today ASCII isn't enough and when you add Unicode matters then
> string management becomes an hairy topic, this may be interesting for D
> developers:
> 
> http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
> 
> Something curious: sometimes I need mutable strings, but I cope with the
> immutable ones when necessary. This author says that even stringAt isn't
> much useful! :-)
> 
> Bye,
> bearophile

This article is pretty correct.

While the topic is at hand, I guess I could rant a little;
Why does D practically *require* the coder to use different forms of UTF 
encoding.

D can tell if a code unit spans multiple bytes, as evidenced in 
converting a utf-8 string to utf-32 (D knows where to split the blocks 
apart), yet we can't index char[] arrays by code units. Instead, D will 
index char arrays by fixed length bytes, which is almost nonsensical 
since the D spec asserts that char[] arrays are designed specifically for 
unicode characters, and that other single byte arrays should instead be 
made as a byte[].
So if this is the case, then why can't the language itself manage multi-
byte characters for us? It would make things a hell of a lot easier and 
more efficient than having to convert /potentially/ foreign strings to 
utf-32 for a simple manipulation operation, then converting them back.
The only reason I can think of for char arrays being treated as fixed 
length is for faster indexing, which is hardly useful in most cases since 
a lot of the time we don't even know if we're dealing with multi-byte 
characters when handling strings, so we have to convert and traverse the 
strings anyway.
Arg.

I know this may probably be a pain to implement, but it would really give 
D a huge leg-up if it could properly and automatically handle strings for 
us. Without requiring a bloated string class.



More information about the Digitalmars-d mailing list