string is rarely useful as a function argument

Sun Jan 1 15:16:49 PST 2012

On 01/01/2012 02:25 PM, Timon Gehr wrote:
> On 01/01/2012 08:01 PM, Chad J wrote:
>> On 01/01/2012 10:39 AM, Timon Gehr wrote:
>>> On 01/01/2012 04:13 PM, Chad J wrote:
>>>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>>>
>>>>>> If you haven't been educated about unicode or how D handles it, you
>>>>>> might write this:
>>>>>>
>>>>>> char[] str;
>>>>>> ... load str ...
>>>>>> for ( int i = 0; i<    str.length; i++ )
>>>>>> {
>>>>>>        font.render(str[i]); // Ewww.
>>>>>>        ...
>>>>>> }
>>>>>>
>>>>>
>>>>> That actually looks like a bug that might happen in real world code.
>>>>> What is the signature of font.render?
>>>>
>>>> In my mind it's defined something like this:
>>>>
>>>> class Font
>>>> {
>>>>    ...
>>>>
>>>>       /** Render the given code point at
>>>>           the current (x,y) cursor position. */
>>>>       void render( dchar c )
>>>>       {
>>>>           ...
>>>>       }
>>>> }
>>>>
>>>> (Of course I don't know minute details like where the "cursor position"
>>>> comes from, but I figure it doesn't matter.)
>>>>
>>>> I probably wrote some code like that loop a very long time ago, but I
>>>> probably don't have that code around anymore, or at least not easily
>>>> findable.
>>>
>>> I think the main issue here is that char implicitly converts to dchar:
>>> This is an implicit reinterpret-cast that is nonsensical if the
>>> character is outside the ascii-range.
>>
>> I agree.
>>
>> Perhaps the compiler should insert a check on the 8th bit in cases like
>> these?
>>
>> I suppose it's possible someone could declare a bunch of individual
>> char's and then start manipulating code units that way, and such an 8th
>> bit check could thwart those manipulations, but I would also counter
>> that such low manipulations should be done on ubyte's instead.
>>
>> I don't know how much this would help though.  Seems like too little,
>> too late.
> 
> I think the conversion char -> dchar should just require an explicit
> cast. The runtime check is better left to std.conv.to;
> 

What of valid transfers of ASCII characters into dchar?

Normally this is a widening operation, so I can see how it is permissible.

>>
>> The bigger problem is that a char is being taken from a char[] and
>> thereby loses its context as (potentially) being part of a larger
>> codepoint.
> 
> If it is part of a larger code point, then it has its highest bit set.
> Any individual char that has its highest bit set does not carry a
> character on its own. If it is not set, then it is a single ASCII
> character.

See above.

I think that assigning from a char[i] to another char[j] is probably
safe.  Similarly for slicing.  These calculations tend to occur, I
suspect, when the text is well-anchored.  I believe your balanced
parentheses example falls into this category:
(repasted for reader convenience)

void main(){
    string s = readln();
    int nest = 0;
    foreach(x;s){ // iterates by code unit
        if(x=='(') nest++;
        else if(x==')' && --nest<0) goto unbalanced;
    }
    if(!nest){
        writeln("balanced parentheses");
        return;
    }
unbalanced:
    writeln("unbalanced parentheses");
}

With these observations in hand, I would consider the safety of
operations to go like this:

char[i] = char[j];           // (Reasonably) Safe
char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
char = char;                 // Safe
dchar = char                 // Safe.  Widening.
char = char[i];              // Not safe.  Should error.
dchar = char[i];             // Not safe.  Should error. (Corollary)
dchar = dchar[i];            // Safe.
char = char[i1..i2];         // Nonsensical; already an error.