char[] annoyance...

Tue Apr 11 13:02:25 PDT 2006

In article <e1ftsj$1spp$1 at digitaldaemon.com>, xs0 says...
>
>Regan Heath wrote:
>> On Mon, 10 Apr 2006 16:28:00 +0000 (UTC), Kevin Bealer 
>> <Kevin_member at pathlink.com> wrote:
>>> In article <ops7rncsol23k2f5 at nrage.netwin.co.nz>, Regan Heath says...
>>>>
>>>> Take this code:
>>>>
>>>> void main()
>>>> {
>>>>     //..open a file, read line for line, on each line:
>>>>
>>>>     for(int i = 0; i < line.length-2; i++) {
>>>>         if (line[i..i+2] != "||") continue;
>>>>         //..etc..
>>>>     }
>>>> }
>>>>
>>>>
>>>> There is a subtle bug. On all lines with a length of 0 or 1 it will give
>>>> the following error:
>>>>
>>>> Error: ArrayBoundsError line_length.d(6)
>>>>
>>>>
>>>> The problem is of course the statement "i < line.length-2". 
>>>> line.length is
>>>> unsigned, and when you - 2 from an unsigned value.. well lets just say
>>>> that it's bigger than the actual length of the line - 2.
>>>>
>>>>
>>>> Of course there are plently of other ways to code this, perhaps using
>>>> foreach, but that's not the point. The point is that this code  _can_ be
>>>> written and on the surface looks fine. Not even -w (warnings) spots the
>>>> signed/unsigned problem. At the very least can we get a warning for 
>>>> this?
>>>>
>>>> Regan
>>>
>>> I see the "gotcha" here, and whether conversions should be done is an
>>> interesting question.  But I wanted to propose a slightly different 
>>> solution.
>>>
>>>> for(int i = 0; i+2 < line.length; i++) {
>>>
>>> Since you are reference "i+2" in the expression, that is really the 
>>> value what you need to be in the [0..length) range.
>> 
>> You're probably right.. now, if only I could train my brain to think of 
>> it that way round :)
>> 
>>> More generally, always do the arithmetic with the signed variables in 
>>> cases like this if there is a possibility of wraparound.
>> 
>> That's good advice, however you have to realise there is a possibility 
>> of wraparound, something I didn't do (or I wouldn't have had the problem 
>> in the first place) :(
>> 
>>> But if the automatic conversions were specified to do uint->int that 
>>> would also be fine with me.
>> 
>> But wouldn't that introduce a bug here:
>>   int a = int.max-1;
>>   uint b = int.max+1;
>>   assert(a < b);
>> 
>> wouldn't b be promoted to a signed value of <some negative number>?

True, hadn't thought of that.  I guess the correct thing (mathematically) is
what is proposed by xs0 below (however, see comments).

>>> One can also imagine a language with the following inequalities:
>>>
>>> +>  unsigned greater-than
>>> +<  unsigned less than
>>> ->  signed greater than
>>> -<  signed less than
>>> +>= unsigned greater-or-equal
>>> etc
>>>
>>> This would not replace the current >, <, but would essentially just be 
>>> shorthand for existing expressions:
>>>
>>> (A +> B) is the same as (cast(uint)A > cast(uint)B)
>>>
>>> .. and so on, except that uint, ulong, ushort, etc would be chosen as 
>>> needed.
>>>
>>> Is this worth doing?  Maybe - it's not a big deal to me, but the 
>>> signed/unsigned question does represent a common gotcha in certain 
>>> expressions.
>> 
>> It's a good idea, but, same problem as before; you have to realise 
>> length is signed and you have to realise there is a chance for 
>> wraparound. A warning about the signed/unsigned comparrison is the very 
>> least we should do. I would be tempted to even make it an outright error 
>> requiring a cast (or one of these new operators above) to handle.
>> 
>> Regan
>
>Wouldn't it be somewhat the best, if signed/unsigned comparison actually 
>worked correctly in mathematical sense?
>
>int a;
>uint b;
>
>(a<b)  becomes (a<0 || cast(uint)a < b)
>(a==b) becomes (a>=0 && cast(uint)a == b)
>(a>b)  becomes (a>=0 && cast(uint)a > b)

I like this conceptually - it seems like definitely the correct approach for a
language like python, but it has the unfortunate effect of adding a conditional
to what is otherwise a one-instruction subtraction (< is a subtract in asm).

>Unfortunately, that doesn't solve the original problem
>
>if (a < b.length-2)
>
>What would solve it is making .length signed. Even though it should 
>puristically be unsigned, and losing one bit could theoretically be a 
>problem (because you couldn't allocate new byte[more than 2GB], the 
>reality is that a 32-bit app can't allocate more than 2GB of RAM anyway 
>(in Windows at least), and for 64-bit apps, losing that one bit is 
>totally insignificant, unless someone expects a computer with 9 exabytes 
>of ram anytime soon :)
>
>And the only place really needing change (and a trivial one) would be 
>the array resizing code..
>
>
>xs0

The unsigned .length would fix this particular issue.  But I think this might be
one of those things where every solution seems to cause problems in some other
area.  For instance, in C, char is (usually) signed, which is good for small
numbers, but when you use char as an array index, it walks off the front of the
array for anything over 128.  Lots and lots of C programs have bugs from this.
It's actually up to the compiler writer whether char is signed in C...

As a result, when writing crypto and hash function code in C, people usually
have to cast the types to get reliable output.  Such code is often peppered by
casts, to force size and signedness to the right kind.  Sometimes you see code
where about half of these are completely unnecessary, but the author simply
decided that *Let's never have to fix this thing again* and did a cast for every
math operation.

I realize this is all tangential, but I think that there are three or four ways
to handle this and all of them probably have a few "gotchas".  You'd almost have
to write a big program with the signed version of .length to know if you got bit
somehow.

I checked a hunch - it looks like .length was changed to size_t in dmd 0.77.
It's also mentioned in the portability section of the documentation:

"The .length, .size, .sizeof, and .alignof  properties will be of type size_t."

Not sure what it was before .77.

Kevin