Making all strings UTF ranges has some risk of WTF

Justin Johansson no at spam.com
Thu Feb 4 14:15:00 PST 2010


Andrei Alexandrescu wrote:
>> Has any thought been given to foreach? Currently all these work for 
>> strings:
>>
>>     foreach (c; "abc") { } // typeof(c) is 'char'
>>     foreach (char c; "abc") { }
>>     foreach (wchar c; "abc") { }
>>     foreach (dchar c; "abc") { }
>>
>> I'm concerned about the first case where the element type is implicit. 
>> The implicit element type is (currently) the code units. If the range 
>> use code points 'dchar' as the element type, then I think foreach 
>> needs to be changed so that the default element type is 'dchar' too 
>> (in the first line of my example). Having ranges and foreach disagree 
>> on this would be very inconsistent. Of course you should be allowed to 
>> iterate using 'char' and 'wchar' too.
>>
>> I think this would fit nicely. I was surprised at first when learning 
>> D and I noticed that foreach didn't do this, that I had to explicitly 
>> has for it.
> 
> This is a good point. I'm in favor of changing the language to make the 
> implicit type dchar.
> 
> Andrei

I concur.  It's great to see consensus moving in this direction.  For
too long Java has suffered the err that a short (i.e. UTF-16 codeunit)
is just about as good as a full Unicode codepoint (i.e. UTF-32
"codeunit").  As a result, the near-enough is good-enough, 16-bit Java
API's means that programmers either forget (as best) or become slack (at
worse) in the dealing of valid Unicode characters.  Part of this
also stems from the culture that if it ain't ASCII or in a Western
character set (BMP), who cares.

As a matter of taste, I'd prefer to see a dchar Unicode codepoint
officially acknowledged/ordained as "unichar", though I guess there
is always the alias resort for pedants like myself.

Cheers
Justin Johansson



More information about the Digitalmars-d mailing list