std.math performance (SSE vs. real)

Wed Jul 2 01:53:03 PDT 2014

On Tuesday, 1 July 2014 at 17:00:30 UTC, Walter Bright wrote:
> On 7/1/2014 3:26 AM, Don wrote:
>> Yes, it's complicated. The interesting thing is that there are 
>> no 128 bit
>> registers. The temporaries exist only while the FMA operation 
>> is in progress.
>> You cannot even preserve them between consecutive FMA 
>> operations.
>>
>> An important consequence is that allowing intermediate 
>> calculations to be
>> performed at higher precision than the operands, is crucial, 
>> and applies outside
>> of x86. This is something we've got right.
>>
>> But it's not possible to say that "the intermediate 
>> calculations are done at the
>> precision of 'real'". This is the semantics which I think we 
>> currently have
>> wrong. Our model is too simplistic.
>>
>> On modern x86, calculations on float operands may have 
>> intermediate calculations
>> done at only 32 bits (if using straight SSE), 80 bits (if 
>> using x87), or 64 bits
>> (if using float FMA). And for double operands, they may be 64 
>> bits, 80 bits, or
>> 128 bits.
>> Yet, in the FMA case, non-FMA operations will be performed at 
>> lower precision.
>> It's entirely possible for all three intermediate precisions 
>> to be active at the
>> same time!
>>
>> I'm not sure that we need to change anything WRT code 
>> generation. But I think
>> our style recommendations aren't quite right. And we have at 
>> least one missing
>> primitive operation (discard all excess precision).
>
> What do you recommend?

It needs some thought. But some things are clear.

Definitely, discarding excess precision is a crucial operation. C 
and C++ tried to do it implicitly with "sequence points", but 
that kills optimisation possibilities so much that compilers 
don't respect it. I think it's actually quite similar to write 
barriers in multithreaded programming. C got it wrong, but we're 
currently in an even worse situation because it doesn't 
necessarily happen at all.

We need a builtin operation -- and not in std.math, this is as 
crucial as addition, and it's purely a signal to the optimiser. 
It's very similar to a casting operation. I wonder if we can do 
it as an attribute?  .exact_float, .restrict_float, .force_float, 
.spill_float or something similar?

With D's current floating point semantics, it's actually 
impossible to write correct floating-point code. Everything that 
works right now, is technically only working by accident.

But if we get this right, we can have very nice semantics for 
when things like FMA are allowed to happen -- essentially the 
optimiser would have free reign between these explicit 
discard_excess_precision sequence points.

After that, I'm a bit less sure. It does seem to me that we're 
trying to make 'real' do double-duty as meaning both "x87 80 bit 
floating-point number" and also as something like a storage class 
that is specific to double: "compiler, don't discard excess 
precision". Which are both useful concepts, but aren't identical. 
The two concepts did coincide on x86 32-bit, but they're 
different on x86-64. I think we need to distinguish the two.

Ideally, I think we'd have a __real80 type. On x86 32 bit this 
would be the same as 'real', while on x86-64 __real80 would be 
available but probably 'real' would alias to double. But I'm a 
lot less certain about this.