Making all strings UTF ranges has some risk of WTF

Thu Feb 4 15:17:26 PST 2010

Don wrote:
> Andrei Alexandrescu wrote:
>> Michel Fortin wrote:
>>> On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
>>> <SeeWebsiteForEmail at erdani.org> said:
>>>
>>>> bearophile wrote:
>>>>> Simen kjaeraas:
>>>>>> Of the above, I feel (b) is the correct solution, and I understand
>>>>>> it has already been implemented in svn.
>>>>>
>>>>> Yes, I presume he was mostly looking for a justification of his ideas
>>>>> he has already accepted and even partially implemented :-)
>>>>
>>>> I am ready to throw away the implementation as soon as a better idea 
>>>> comes around. As other times, I operated the change to see how 
>>>> things feel with the new approach.
>>>
>>> Has any thought been given to foreach? Currently all these work for 
>>> strings:
>>>
>>>     foreach (c; "abc") { } // typeof(c) is 'char'
>>>     foreach (char c; "abc") { }
>>>     foreach (wchar c; "abc") { }
>>>     foreach (dchar c; "abc") { }
>>>
>>> I'm concerned about the first case where the element type is 
>>> implicit. The implicit element type is (currently) the code units. If 
>>> the range use code points 'dchar' as the element type, then I think 
>>> foreach needs to be changed so that the default element type is 
>>> 'dchar' too (in the first line of my example). Having ranges and 
>>> foreach disagree on this would be very inconsistent. Of course you 
>>> should be allowed to iterate using 'char' and 'wchar' too.
>>>
>>> I think this would fit nicely. I was surprised at first when learning 
>>> D and I noticed that foreach didn't do this, that I had to explicitly 
>>> has for it.
>>
>> This is a good point. I'm in favor of changing the language to make 
>> the implicit type dchar.
>>
>> Andrei
> 
> We seem to be approaching the point where char[], wchar[] and dchar[] 
> are all arrays of dchar, but with different levels of compression.

That is a good way to look at things.

> It makes me wonder if the char, wchar types actually make any sense.
> If char[] is actually a UTF string, then char[] ~ char should be 
> permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
> you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
> not necessarily result in a valid unicode string.

Well as it's been mentioned, sometimes you may assemble a string out of 
individual characters. Probably that case is rare enough to warrant a 
cast. Note that today char is already convertible to dchar (there's no 
checking).

> I suspect that string, wstring should have been the primary types and 
> had a .codepoints property, which returned a ubyte[] resp. ushort[] 
> reference to the data. It's too late, of course. The extra value you get 
> by having a specific type for 'this is a code point for a UTF8 string' 
> seems to be very minor, compared to just using a ubyte.

What we can do is to have to!(const ubyte[]) work for all UTF8 strings 
and to!(const ushort[]) work for all UTF16 strings. That view is correct 
and safe. Also, it's not difficult to add a .codepoints pseudo-property.

Andrei