String implementations

Wed Jan 16 07:27:53 PST 2008

"Jarrod" wrote
> On Tue, 15 Jan 2008 21:23:31 -0500, bearophile wrote:
>
>> Defining how an ASCII string is best managed by a language is already
>> complex (ropes or not? Mutable or not? With shared parts or not? Etc),
>> but today ASCII isn't enough and when you add Unicode matters then
>> string management becomes an hairy topic, this may be interesting for D
>> developers:
>>
>> http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
>>
>> Something curious: sometimes I need mutable strings, but I cope with the
>> immutable ones when necessary. This author says that even stringAt isn't
>> much useful! :-)
>>
>> Bye,
>> bearophile
>
> This article is pretty correct.
>
> While the topic is at hand, I guess I could rant a little;
> Why does D practically *require* the coder to use different forms of UTF
> encoding.
>
> D can tell if a code unit spans multiple bytes, as evidenced in
> converting a utf-8 string to utf-32 (D knows where to split the blocks
> apart), yet we can't index char[] arrays by code units. Instead, D will
> index char arrays by fixed length bytes, which is almost nonsensical
> since the D spec asserts that char[] arrays are designed specifically for
> unicode characters, and that other single byte arrays should instead be
> made as a byte[].
> So if this is the case, then why can't the language itself manage multi-
> byte characters for us? It would make things a hell of a lot easier and
> more efficient than having to convert /potentially/ foreign strings to
> utf-32 for a simple manipulation operation, then converting them back.
> The only reason I can think of for char arrays being treated as fixed
> length is for faster indexing, which is hardly useful in most cases since
> a lot of the time we don't even know if we're dealing with multi-byte
> characters when handling strings, so we have to convert and traverse the
> strings anyway.

The algorithmic penalties would be O(n) for a indexed lookup then instead of 
O(1).  I think the way it is now is the best of all worlds.

I think the correct method in this case is to convert to utf32 first, then 
index.  Then at least you only take the O(n) penalty once.  Or why not just 
use dchar[] instead of char[] to begin with?

-Steve