std.math performance (SSE vs. real)

Tue Jul 1 10:00:30 PDT 2014

On 7/1/2014 3:26 AM, Don wrote:
> Yes, it's complicated. The interesting thing is that there are no 128 bit
> registers. The temporaries exist only while the FMA operation is in progress.
> You cannot even preserve them between consecutive FMA operations.
>
> An important consequence is that allowing intermediate calculations to be
> performed at higher precision than the operands, is crucial, and applies outside
> of x86. This is something we've got right.
>
> But it's not possible to say that "the intermediate calculations are done at the
> precision of 'real'". This is the semantics which I think we currently have
> wrong. Our model is too simplistic.
>
> On modern x86, calculations on float operands may have intermediate calculations
> done at only 32 bits (if using straight SSE), 80 bits (if using x87), or 64 bits
> (if using float FMA). And for double operands, they may be 64 bits, 80 bits, or
> 128 bits.
> Yet, in the FMA case, non-FMA operations will be performed at lower precision.
> It's entirely possible for all three intermediate precisions to be active at the
> same time!
>
> I'm not sure that we need to change anything WRT code generation. But I think
> our style recommendations aren't quite right. And we have at least one missing
> primitive operation (discard all excess precision).

What do you recommend?