Making all strings UTF ranges has some risk of WTF

Thu Feb 4 12:16:02 PST 2010

Andrei Alexandrescu wrote:
> Michel Fortin wrote:
>> On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
>> <SeeWebsiteForEmail at erdani.org> said:
>>
>>> bearophile wrote:
>>>> Simen kjaeraas:
>>>>> Of the above, I feel (b) is the correct solution, and I understand
>>>>> it has already been implemented in svn.
>>>>
>>>> Yes, I presume he was mostly looking for a justification of his ideas
>>>> he has already accepted and even partially implemented :-)
>>>
>>> I am ready to throw away the implementation as soon as a better idea 
>>> comes around. As other times, I operated the change to see how things 
>>> feel with the new approach.
>>
>> Has any thought been given to foreach? Currently all these work for 
>> strings:
>>
>>     foreach (c; "abc") { } // typeof(c) is 'char'
>>     foreach (char c; "abc") { }
>>     foreach (wchar c; "abc") { }
>>     foreach (dchar c; "abc") { }
>>
>> I'm concerned about the first case where the element type is implicit. 
>> The implicit element type is (currently) the code units. If the range 
>> use code points 'dchar' as the element type, then I think foreach 
>> needs to be changed so that the default element type is 'dchar' too 
>> (in the first line of my example). Having ranges and foreach disagree 
>> on this would be very inconsistent. Of course you should be allowed to 
>> iterate using 'char' and 'wchar' too.
>>
>> I think this would fit nicely. I was surprised at first when learning 
>> D and I noticed that foreach didn't do this, that I had to explicitly 
>> has for it.
> 
> This is a good point. I'm in favor of changing the language to make the 
> implicit type dchar.
> 
> Andrei

We seem to be approaching the point where char[], wchar[] and dchar[] 
are all arrays of dchar, but with different levels of compression.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be 
permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
not necessarily result in a valid unicode string.

I suspect that string, wstring should have been the primary types and 
had a .codepoints property, which returned a ubyte[] resp. ushort[] 
reference to the data. It's too late, of course. The extra value you get 
by having a specific type for 'this is a code point for a UTF8 string' 
seems to be very minor, compared to just using a ubyte.