Treating the abusive unsigned syndrome

Thu Nov 27 19:34:50 PST 2008

Andrei Alexandrescu wrote:
> KennyTM~ wrote:
>> KennyTM~ wrote:
>>> Andrei Alexandrescu wrote:
>>>> Don wrote:
>>>>> Andrei Alexandrescu wrote:
>>>>>> Don wrote:
>>>>>>> Andrei Alexandrescu wrote:
>>>>>>>> One fear of mine is the reaction of throwing of hands in the air 
>>>>>>>> "how many integral types are enough???". However, if we're to 
>>>>>>>> judge by the addition of long long and a slew of typedefs to C99 
>>>>>>>> and C++0x, the answer is "plenty". I'd be interested in gaging 
>>>>>>>> how people feel about adding two (bits64, bits32) or even four 
>>>>>>>> (bits64, bits32, bits16, and bits8) types as basic types. They'd 
>>>>>>>> be bitbags with undecided sign ready to be converted to their 
>>>>>>>> counterparts of decided sign.
>>>>>>>
>>>>>>> Here I think we have a fundamental disagreement: what is an 
>>>>>>> 'unsigned int'? There are two disparate ideas:
>>>>>>>
>>>>>>> (A) You think that it is an approximation to a natural number, 
>>>>>>> ie, a 'positive int'.
>>>>>>> (B) I think that it is a 'number with NO sign'; that is, the sign 
>>>>>>> depends on context. It may, for example, be part of a larger 
>>>>>>> number. Thus, I largely agree with the C behaviour -- once you 
>>>>>>> have an unsigned in a calculation, it's up to the programmer to 
>>>>>>> provide an interpretation.
>>>>>>>
>>>>>>> Unfortunately, the two concepts are mashed together in C-family 
>>>>>>> languages. (B) is the concept supported by the language typing 
>>>>>>> rules, but usage of (A) is widespread in practice.
>>>>>>
>>>>>> In fact we are in agreement. C tries to make it usable as both, 
>>>>>> and partially succeeds by having very lax conversions in all 
>>>>>> directions. This leads to the occasional puzzling behaviors. I do 
>>>>>> *want* uint to be an approximation of a natural number, while 
>>>>>> acknowledging that today it isn't much of that.
>>>>>>
>>>>>>> If we were going to introduce a slew of new types, I'd want them 
>>>>>>> to be for 'positive int'/'natural int', 'positive byte', etc.
>>>>>>>
>>>>>>> Natural int can always be implicitly converted to either int or 
>>>>>>> uint, with perfect safety. No other conversions are possible 
>>>>>>> without a cast.
>>>>>>> Non-negative literals and manifest constants are naturals.
>>>>>>>
>>>>>>> The rules are:
>>>>>>> 1. Anything involving unsigned is unsigned, (same as C).
>>>>>>> 2. Else if it contains an integer, it is an integer.
>>>>>>> 3. (Now we know all quantities are natural):
>>>>>>> If it contains a subtraction, it is an integer [Probably allow 
>>>>>>> subtraction of compile-time quantities to remain natural, if the 
>>>>>>> values stay in range; flag an error if an overflow occurs].
>>>>>>> 4. Else it is a natural.
>>>>>>>
>>>>>>>
>>>>>>> The reason I think literals and manifest constants are so 
>>>>>>> important is that they are a significant fraction of the natural 
>>>>>>> numbers in a program.
>>>>>>>
>>>>>>> [Just before posting I've discovered that other people have 
>>>>>>> posted some similar ideas].
>>>>>>
>>>>>> That sounds encouraging. One problem is that your approach leaves 
>>>>>> the unsigned mess as it is, so although natural types are a nice 
>>>>>> addition, they don't bring a complete solution to the table.
>>>>>>
>>>>>>
>>>>>> Andrei
>>>>>
>>>>> Well, it does make unsigned numbers (case (B)) quite obscure and 
>>>>> low-level. They could be renamed with uglier names to make this 
>>>>> clearer.
>>>>> But since in this proposal there are no implicit conversions from 
>>>>> uint to anything, it's hard to do any damage with the unsigned type 
>>>>> which results.
>>>>> Basically, with any use of unsigned, the compiler says "I don't 
>>>>> know if this thing even has a meaningful sign!".
>>>>>
>>>>> Alternatively, we could add rule 0: mixing int and unsigned is 
>>>>> illegal. But it's OK to mix natural with int, or natural with 
>>>>> unsigned.
>>>>> I don't like this as much, since it would make most usage of 
>>>>> unsigned ugly; but maybe that's justified.
>>>>
>>>> I think we're heading towards an impasse. We wouldn't want to make 
>>>> things much harder for systems-level programs that mix arithmetic 
>>>> and bit-level operations.
>>>>
>>>> I'm glad there is interest and that quite a few ideas were brought 
>>>> up. Unfortunately, it looks like all have significant disadvantages.
>>>>
>>>> One compromise solution Walter and I discussed in the past is to 
>>>> only sever one of the dangerous implicit conversions: int -> uint. 
>>>> Other than that, it's much like C (everything involving one unsigned 
>>>> is unsigned and unsigned -> signed is implicit) Let's see where that 
>>>> takes us.
>>>>
>>>> (a) There are fewer situations when a small, reasonable number 
>>>> implicitly becomes a large, weird numnber.
>>>>
>>>> (b) An exception to (a) is that u1 - u2 is also uint, and that's for 
>>>> the sake of C compatibility. I'd gladly drop it if I could and leave 
>>>> operations such as u1 - u2 return a signed number. That assumes the 
>>>> least and works with small, usual values.
>>>>
>>>> (c) Unlike C, arithmetic and logical operations always return the 
>>>> tightest type possible, not a 32/64 bit value. For example, byte / 
>>>> int yields byte and so on.
>>>>
>>>
>>> So you mean long * int (e.g. 1234567890123L * 2) will return an int 
>>> instead of a long?!
>>>
>>> The opposite sounds more natural to me.
>>>
>>
>> Em, or do you mean the tightest type that can represent all possible 
>> results? (so long*int == cent?)
> 
> The tightest type possible depends on the operation. In that doctrine, 
> long * int yields a long (given the demise of cent). Walters things such 
> rules are too complicated, but I'm a big fan of operation-dependent 
> typing. I see no good reason for requiring int * long have the same type 
> as int / long. They are different operations with different semantics 
> and corner cases and whatnot, so the resulting static type may as well 
> be different.
> 
> By the way, under the tightest type doctrine, uint & ubyte is typed as 
> ubyte. Interesting that one, huh :o).
> 
> 
> Andrei

I just remembered a problem with simplemindedly going with the tightest 
type. Consider:

uint a = ...;
ubyte b = ...;
auto c = a & b;
c <<= 16;
...

The programmer may reasonably expect that the bitwise operation yields 
an unsigned integer because it involved one. However, the zealous 
compiler cleverly notices the operation really never yields something 
larger than a ubyte, and therefore returns that "tightest" type, thus 
making c a ubyte. Subsequent uses of c will be surprising to the 
programmer who thought c has 32 bits.

It looks like polysemy is the only solution here: return a polysemous 
value with principal type uint and possible type ubyte. That way, c will 
be typed as uint. But at the same time, continuing the example:

ubyte d = a & b;

will go through without a cast. That's pretty cool!

One question I had is: say polysemy will be at work for integral 
arithmetic. Should we provide means in the language for user-defined 
polysemous functions? Or is it ok to leave it as compiler magic that 
saves redundant casts?

Andrei