std.math performance (SSE vs. real)

Tue Jul 1 03:26:57 PDT 2014

On Monday, 30 June 2014 at 16:54:17 UTC, Walter Bright wrote:
> On 6/30/2014 12:20 AM, Don wrote:
>> What I think is highly likely is that it will only have legacy 
>> support, with
>> such awful performance that it never makes sense to use them. 
>> For example, the
>> speed of 80-bit and 64-bit calculations in x87 used to be 
>> identical. But on
>> recent Intel CPUs, the 80-bit operations run at half the speed 
>> of the 64 bit
>> operations. They are already partially microcoded.
>>
>> For me, a stronger argument is that you can get *higher* 
>> precision using
>> doubles, in many cases. The reason is that FMA gives you an 
>> intermediate value
>> with 128 bits of precision; it's available in SIMD but not on 
>> x87.
>>
>> So, if we want to use the highest precision supported by the 
>> hardware, that does
>> *not* mean we should always use 80 bits.
>>
>> I've experienced this in CTFE, where the calculations are 
>> currently done in 80
>> bits, I've seen cases where the 64-bit runtime results were 
>> more accurate,
>> because of those 128 bit FMA temporaries. 80 bits are not 
>> enough!!
>
> I did not know this. It certainly adds another layer of nuance 
> - as the higher level of precision will only apply as long as 
> one can keep the value in a register.

Yes, it's complicated. The interesting thing is that there are no 
128 bit registers. The temporaries exist only while the FMA 
operation is in progress. You cannot even preserve them between 
consecutive FMA operations.

An important consequence is that allowing intermediate 
calculations to be performed at higher precision than the 
operands, is crucial, and applies outside of x86. This is 
something we've got right.

But it's not possible to say that "the intermediate calculations 
are done at the precision of 'real'". This is the semantics which 
I think we currently have wrong. Our model is too simplistic.

On modern x86, calculations on float operands may have 
intermediate calculations done at only 32 bits (if using straight 
SSE), 80 bits (if using x87), or 64 bits (if using float FMA). 
And for double operands, they may be 64 bits, 80 bits, or 128 
bits.
Yet, in the FMA case, non-FMA operations will be performed at 
lower precision.
It's entirely possible for all three intermediate precisions to 
be active at the same time!

I'm not sure that we need to change anything WRT code generation. 
But I think our style recommendations aren't quite right. And we 
have at least one missing primitive operation (discard all excess 
precision).