D's confusing strings (was Re: D on hackernews)

Wed Sep 21 12:06:25 PDT 2011

On 21/09/11 5:39 PM, Andrei Alexandrescu wrote:
> On 9/21/11 10:16 AM, Christophe wrote:
>> Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
>>> unicode natively. Yet the 'D strings are strange and confusing' argument
>>> comes up quite often on the web.
>>
>> Well, I think they are. The ptr+length stuff is amasing, but the
>> behavior of strings in phobos is weird.
>>
>> mini-quiz: what should std.range.drop(some_string, 1) do ?
>> hint: what it actually does is not what the documentation of phobos
>> suggests*...
>>
>> Strings are array of char, but they appear like a lazy range of dchar to
>> phobos. I could cope with the fact that this is a little unexpected for
>> beginners. But well, that creates a lot of exceptions in phobos, like
>> the fact that you can't even copy a char[] to a char[] with
>> std.algorithm.copy. And I don't mention all the optimization that are
>> not/cannot be performed for those strings. I'll just remember to use
>> ubyte[] wherever I can...
>
> String handling in D is good modulo the oddities you noticed. What would
> make it perfect would be:
>
> * Add property .rep that returns byte[], ushort[], or uint[] for char[],
> wchar[], dchar[] respectively (with the appropriate qualifier).
>
> * Replace .length with .codeUnits.
>
> * Disallow [n] and [m .. n]
>
> This would upgrade D's strings from good to awesome. Really it would be
> a dream come true. Unfortunately it would also break most D code there
> is out there. I don't see how we can improve the current situation while
> staying backward compatible.
>
>
> Andrei

 From what I can see, the problem with D string is that they are a 
'magic' special case for arrays.

char[] should be an array of char, just like int[] is an array of int. 
If you have a T[] arr, then typeof(arr.front) should be T. This is what 
everyone would expect. char[] should essentially be the same as byte[], 
although char[] would be more natural for ASCII strings.

string should be something different, a separate type. As you say, 
disallow [n] and [m..n] would be good as they make no sense with VLE. 
You could have .length and .codeUnits, but length would have to be O(n). 
That's not ideal, but since string wouldn't be an array, it doesn't need 
to have the same complexity guarantees.

Same for wchar[], dchar[], wstring and dstring.

Of course, making that change would break existing code. Maybe D3? :-)