[OT] The Usual Arithmetic Confusions
Siarhei Siamashka
siarhei.siamashka at gmail.com
Fri Feb 4 22:18:45 UTC 2022
On Friday, 4 February 2022 at 21:13:10 UTC, Walter Bright wrote:
> The integral promotion rules came about because of how the
> PDP-11 instruction set worked, as C was developed on an -11.
> But this has carried over into modern CPUs. Consider:
>
> ```
> void tests(short* a, short* b, short* c) { *c = *a * *b; }
> 0F B7 07 movzx EAX,word ptr [RDI]
> 66 0F AF 06 imul AX,[RSI]
> 66 89 02 mov [RDX],AX
> C3 ret
>
> void testi(int* a, int* b, int* c) { *c = *a * *b; }
> 8B 07 mov EAX,[RDI]
> 0F AF 06 imul EAX,[RSI]
> 89 02 mov [RDX],EAX
> C3 ret
> ```
> You're paying a 3 size byte penalty for using short arithmetic
> rather than int arithmetic. It's slower, too.
Larger code size is surely more stressful for the instructions
cache, but the slowdown caused by this is most likely barely
measurable on modern processors.
> Generally speaking, int should be used for most calculations,
> short and byte for storage.
>
> (Modern CPUs have long been deliberately optimized and tuned
> for C semantics.)
I generally agree, but this is only valid for the regular scalar
code. Autovectorizable code taking advantage of SIMD instructions
looks a bit different. Consider:
void tests(short* a, short* b, short* c, int n) { while (n--)
*c++ = *a++ * *b++; }
<...>
50: f3 0f 6f 04 07 movdqu (%rdi,%rax,1),%xmm0
55: f3 0f 6f 0c 06 movdqu (%rsi,%rax,1),%xmm1
5a: 66 0f d5 c1 pmullw %xmm1,%xmm0
5e: 0f 11 04 02 movups %xmm0,(%rdx,%rax,1)
62: 48 83 c0 10 add $0x10,%rax
66: 4c 39 c0 cmp %r8,%rax
69: 75 e5 jne 50 <tests+0x50>
<...>
7 instructions, which are doing 8 multiplications per inner loop
iteration.
void testi(int* a, int* b, int* c, int n) { while (n--) *c++
= *a++ * *b++; }
<...>
188: f3 0f 6f 04 07 movdqu (%rdi,%rax,1),%xmm0
18d: f3 0f 6f 0c 06 movdqu (%rsi,%rax,1),%xmm1
192: 66 0f 38 40 c1 pmulld %xmm1,%xmm0
197: 0f 11 04 02 movups %xmm0,(%rdx,%rax,1)
19b: 48 83 c0 10 add $0x10,%rax
19f: 4c 39 c0 cmp %r8,%rax
1a2: 75 e4 jne 188 <testi+0x48>
<...>
7 instructions, which are doing 4 multiplications per inner loop
iteration.
The code size increases really a lot, because there are large
prologue and epilogue parts before and after the inner loop. But
the performance improves really a lot when processing large
arrays. And the 16-bit version is roughly twice faster than the
32-bit version (because each 128-bit XMM register represents
either 8 shorts or 4 ints).
If we want D language to be SIMD friendly, then discouraging the
use of `short` and `byte` types for local variables isn't the
best idea.
More information about the Digitalmars-d
mailing list