Making all strings UTF ranges has some risk of WTF
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Thu Feb 4 15:17:26 PST 2010
Don wrote:
> Andrei Alexandrescu wrote:
>> Michel Fortin wrote:
>>> On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> said:
>>>
>>>> bearophile wrote:
>>>>> Simen kjaeraas:
>>>>>> Of the above, I feel (b) is the correct solution, and I understand
>>>>>> it has already been implemented in svn.
>>>>>
>>>>> Yes, I presume he was mostly looking for a justification of his ideas
>>>>> he has already accepted and even partially implemented :-)
>>>>
>>>> I am ready to throw away the implementation as soon as a better idea
>>>> comes around. As other times, I operated the change to see how
>>>> things feel with the new approach.
>>>
>>> Has any thought been given to foreach? Currently all these work for
>>> strings:
>>>
>>> foreach (c; "abc") { } // typeof(c) is 'char'
>>> foreach (char c; "abc") { }
>>> foreach (wchar c; "abc") { }
>>> foreach (dchar c; "abc") { }
>>>
>>> I'm concerned about the first case where the element type is
>>> implicit. The implicit element type is (currently) the code units. If
>>> the range use code points 'dchar' as the element type, then I think
>>> foreach needs to be changed so that the default element type is
>>> 'dchar' too (in the first line of my example). Having ranges and
>>> foreach disagree on this would be very inconsistent. Of course you
>>> should be allowed to iterate using 'char' and 'wchar' too.
>>>
>>> I think this would fit nicely. I was surprised at first when learning
>>> D and I noticed that foreach didn't do this, that I had to explicitly
>>> has for it.
>>
>> This is a good point. I'm in favor of changing the language to make
>> the implicit type dchar.
>>
>> Andrei
>
> We seem to be approaching the point where char[], wchar[] and dchar[]
> are all arrays of dchar, but with different levels of compression.
That is a good way to look at things.
> It makes me wonder if the char, wchar types actually make any sense.
> If char[] is actually a UTF string, then char[] ~ char should be
> permitted ONLY if char can be implicitly converted to dchar. Otherwise,
> you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
> not necessarily result in a valid unicode string.
Well as it's been mentioned, sometimes you may assemble a string out of
individual characters. Probably that case is rare enough to warrant a
cast. Note that today char is already convertible to dchar (there's no
checking).
> I suspect that string, wstring should have been the primary types and
> had a .codepoints property, which returned a ubyte[] resp. ushort[]
> reference to the data. It's too late, of course. The extra value you get
> by having a specific type for 'this is a code point for a UTF8 string'
> seems to be very minor, compared to just using a ubyte.
What we can do is to have to!(const ubyte[]) work for all UTF8 strings
and to!(const ushort[]) work for all UTF16 strings. That view is correct
and safe. Also, it's not difficult to add a .codepoints pseudo-property.
Andrei
More information about the Digitalmars-d
mailing list