[OT] The Usual Arithmetic Confusions

Fri Feb 4 22:18:45 UTC 2022

On Friday, 4 February 2022 at 21:13:10 UTC, Walter Bright wrote:
> The integral promotion rules came about because of how the 
> PDP-11 instruction set worked, as C was developed on an -11. 
> But this has carried over into modern CPUs. Consider:
>
> ```
> void tests(short* a, short* b, short* c) { *c = *a * *b; }
>         0F B7 07                movzx   EAX,word ptr [RDI]
> 66      0F AF 06                imul    AX,[RSI]
> 66      89 02                   mov     [RDX],AX
>         C3                      ret
>
> void testi(int* a, int* b, int* c) { *c = *a * *b; }
>         8B 07                   mov     EAX,[RDI]
>         0F AF 06                imul    EAX,[RSI]
>         89 02                   mov     [RDX],EAX
>         C3                      ret
> ```
> You're paying a 3 size byte penalty for using short arithmetic 
> rather than int arithmetic. It's slower, too.

Larger code size is surely more stressful for the instructions 
cache, but the slowdown caused by this is most likely barely 
measurable on modern processors.

> Generally speaking, int should be used for most calculations, 
> short and byte for storage.
>
> (Modern CPUs have long been deliberately optimized and tuned 
> for C semantics.)

I generally agree, but this is only valid for the regular scalar 
code. Autovectorizable code taking advantage of SIMD instructions 
looks a bit different. Consider:

     void tests(short* a, short* b, short* c, int n) { while (n--) 
*c++ = *a++ * *b++; }
       <...>
       50:       f3 0f 6f 04 07       	movdqu (%rdi,%rax,1),%xmm0
       55:       f3 0f 6f 0c 06       	movdqu (%rsi,%rax,1),%xmm1
       5a:       66 0f d5 c1          	pmullw %xmm1,%xmm0
       5e:       0f 11 04 02          	movups %xmm0,(%rdx,%rax,1)
       62:       48 83 c0 10          	add    $0x10,%rax
       66:       4c 39 c0             	cmp    %r8,%rax
       69:       75 e5                	jne    50 <tests+0x50>
       <...>

7 instructions, which are doing 8 multiplications per inner loop 
iteration.

     void testi(int* a, int* b, int* c, int n) { while (n--) *c++ 
= *a++ * *b++; }
      <...>
      188:       f3 0f 6f 04 07       	movdqu (%rdi,%rax,1),%xmm0
      18d:       f3 0f 6f 0c 06       	movdqu (%rsi,%rax,1),%xmm1
      192:       66 0f 38 40 c1       	pmulld %xmm1,%xmm0
      197:       0f 11 04 02          	movups %xmm0,(%rdx,%rax,1)
      19b:       48 83 c0 10          	add    $0x10,%rax
      19f:       4c 39 c0             	cmp    %r8,%rax
      1a2:       75 e4                	jne    188 <testi+0x48>
      <...>

7 instructions, which are doing 4 multiplications per inner loop 
iteration.

The code size increases really a lot, because there are large 
prologue and epilogue parts before and after the inner loop. But 
the performance improves really a lot when processing large 
arrays. And the 16-bit version is roughly twice faster than the 
32-bit version (because each 128-bit XMM register represents 
either 8 shorts or 4 ints).

If we want D language to be SIMD friendly, then discouraging the 
use of `short` and `byte` types for local variables isn't the 
best idea.