Making all strings UTF ranges has some risk of WTF
Don
nospam at nospam.com
Thu Feb 4 12:16:02 PST 2010
Andrei Alexandrescu wrote:
> Michel Fortin wrote:
>> On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> said:
>>
>>> bearophile wrote:
>>>> Simen kjaeraas:
>>>>> Of the above, I feel (b) is the correct solution, and I understand
>>>>> it has already been implemented in svn.
>>>>
>>>> Yes, I presume he was mostly looking for a justification of his ideas
>>>> he has already accepted and even partially implemented :-)
>>>
>>> I am ready to throw away the implementation as soon as a better idea
>>> comes around. As other times, I operated the change to see how things
>>> feel with the new approach.
>>
>> Has any thought been given to foreach? Currently all these work for
>> strings:
>>
>> foreach (c; "abc") { } // typeof(c) is 'char'
>> foreach (char c; "abc") { }
>> foreach (wchar c; "abc") { }
>> foreach (dchar c; "abc") { }
>>
>> I'm concerned about the first case where the element type is implicit.
>> The implicit element type is (currently) the code units. If the range
>> use code points 'dchar' as the element type, then I think foreach
>> needs to be changed so that the default element type is 'dchar' too
>> (in the first line of my example). Having ranges and foreach disagree
>> on this would be very inconsistent. Of course you should be allowed to
>> iterate using 'char' and 'wchar' too.
>>
>> I think this would fit nicely. I was surprised at first when learning
>> D and I noticed that foreach didn't do this, that I had to explicitly
>> has for it.
>
> This is a good point. I'm in favor of changing the language to make the
> implicit type dchar.
>
> Andrei
We seem to be approaching the point where char[], wchar[] and dchar[]
are all arrays of dchar, but with different levels of compression.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be
permitted ONLY if char can be implicitly converted to dchar. Otherwise,
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
not necessarily result in a valid unicode string.
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
More information about the Digitalmars-d
mailing list