string is rarely useful as a function argument

Timon Gehr timon.gehr at gmx.ch
Sun Jan 1 15:36:36 PST 2012


On 01/02/2012 12:16 AM, Chad J wrote:
> On 01/01/2012 02:25 PM, Timon Gehr wrote:
>> On 01/01/2012 08:01 PM, Chad J wrote:
>>> On 01/01/2012 10:39 AM, Timon Gehr wrote:
>>>> On 01/01/2012 04:13 PM, Chad J wrote:
>>>>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>>>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>>>>
>>>>>>> If you haven't been educated about unicode or how D handles it, you
>>>>>>> might write this:
>>>>>>>
>>>>>>> char[] str;
>>>>>>> ... load str ...
>>>>>>> for ( int i = 0; i<     str.length; i++ )
>>>>>>> {
>>>>>>>         font.render(str[i]); // Ewww.
>>>>>>>         ...
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> That actually looks like a bug that might happen in real world code.
>>>>>> What is the signature of font.render?
>>>>>
>>>>> In my mind it's defined something like this:
>>>>>
>>>>> class Font
>>>>> {
>>>>>     ...
>>>>>
>>>>>        /** Render the given code point at
>>>>>            the current (x,y) cursor position. */
>>>>>        void render( dchar c )
>>>>>        {
>>>>>            ...
>>>>>        }
>>>>> }
>>>>>
>>>>> (Of course I don't know minute details like where the "cursor position"
>>>>> comes from, but I figure it doesn't matter.)
>>>>>
>>>>> I probably wrote some code like that loop a very long time ago, but I
>>>>> probably don't have that code around anymore, or at least not easily
>>>>> findable.
>>>>
>>>> I think the main issue here is that char implicitly converts to dchar:
>>>> This is an implicit reinterpret-cast that is nonsensical if the
>>>> character is outside the ascii-range.
>>>
>>> I agree.
>>>
>>> Perhaps the compiler should insert a check on the 8th bit in cases like
>>> these?
>>>
>>> I suppose it's possible someone could declare a bunch of individual
>>> char's and then start manipulating code units that way, and such an 8th
>>> bit check could thwart those manipulations, but I would also counter
>>> that such low manipulations should be done on ubyte's instead.
>>>
>>> I don't know how much this would help though.  Seems like too little,
>>> too late.
>>
>> I think the conversion char ->  dchar should just require an explicit
>> cast. The runtime check is better left to std.conv.to;
>>
>
> What of valid transfers of ASCII characters into dchar?
>
> Normally this is a widening operation, so I can see how it is permissible.
>
>>>
>>> The bigger problem is that a char is being taken from a char[] and
>>> thereby loses its context as (potentially) being part of a larger
>>> codepoint.
>>
>> If it is part of a larger code point, then it has its highest bit set.
>> Any individual char that has its highest bit set does not carry a
>> character on its own. If it is not set, then it is a single ASCII
>> character.
>
> See above.
>
>
> I think that assigning from a char[i] to another char[j] is probably
> safe.  Similarly for slicing.  These calculations tend to occur, I
> suspect, when the text is well-anchored.  I believe your balanced
> parentheses example falls into this category:
> (repasted for reader convenience)
>
> void main(){
>      string s = readln();
>      int nest = 0;
>      foreach(x;s){ // iterates by code unit
>          if(x=='(') nest++;
>          else if(x==')'&&  --nest<0) goto unbalanced;
>      }
>      if(!nest){
>          writeln("balanced parentheses");
>          return;
>      }
> unbalanced:
>      writeln("unbalanced parentheses");
> }
>
> With these observations in hand, I would consider the safety of
> operations to go like this:
>
> char[i] = char[j];           // (Reasonably) Safe
> char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
> char = char;                 // Safe
> dchar = char                 // Safe.  Widening.
> char = char[i];              // Not safe.  Should error.
> dchar = char[i];             // Not safe.  Should error. (Corollary)
> dchar = dchar[i];            // Safe.
> char = char[i1..i2];         // Nonsensical; already an error.

That is an interesting point of view. Your proposal would therefore be 
to constrain char to the ASCII range except if it is embedded in an 
array? It would break the balanced parentheses example.


More information about the Digitalmars-d mailing list