std.math performance (SSE vs. real)

Iain Buclaw via Digitalmars-d digitalmars-d at puremagic.com
Wed Jul 2 03:21:09 PDT 2014


On 2 July 2014 09:53, Don via Digitalmars-d <digitalmars-d at puremagic.com> wrote:
> On Tuesday, 1 July 2014 at 17:00:30 UTC, Walter Bright wrote:
>>
>> On 7/1/2014 3:26 AM, Don wrote:
>>>
>>> Yes, it's complicated. The interesting thing is that there are no 128 bit
>>> registers. The temporaries exist only while the FMA operation is in
>>> progress.
>>> You cannot even preserve them between consecutive FMA operations.
>>>
>>> An important consequence is that allowing intermediate calculations to be
>>> performed at higher precision than the operands, is crucial, and applies
>>> outside
>>> of x86. This is something we've got right.
>>>
>>> But it's not possible to say that "the intermediate calculations are done
>>> at the
>>> precision of 'real'". This is the semantics which I think we currently
>>> have
>>> wrong. Our model is too simplistic.
>>>
>>> On modern x86, calculations on float operands may have intermediate
>>> calculations
>>> done at only 32 bits (if using straight SSE), 80 bits (if using x87), or
>>> 64 bits
>>> (if using float FMA). And for double operands, they may be 64 bits, 80
>>> bits, or
>>> 128 bits.
>>> Yet, in the FMA case, non-FMA operations will be performed at lower
>>> precision.
>>> It's entirely possible for all three intermediate precisions to be active
>>> at the
>>> same time!
>>>
>>> I'm not sure that we need to change anything WRT code generation. But I
>>> think
>>> our style recommendations aren't quite right. And we have at least one
>>> missing
>>> primitive operation (discard all excess precision).
>>
>>
>> What do you recommend?
>
>
> It needs some thought. But some things are clear.
>
> Definitely, discarding excess precision is a crucial operation. C and C++
> tried to do it implicitly with "sequence points", but that kills
> optimisation possibilities so much that compilers don't respect it. I think
> it's actually quite similar to write barriers in multithreaded programming.
> C got it wrong, but we're currently in an even worse situation because it
> doesn't necessarily happen at all.
>
> We need a builtin operation -- and not in std.math, this is as crucial as
> addition, and it's purely a signal to the optimiser. It's very similar to a
> casting operation. I wonder if we can do it as an attribute?  .exact_float,
> .restrict_float, .force_float, .spill_float or something similar?
>
> With D's current floating point semantics, it's actually impossible to write
> correct floating-point code. Everything that works right now, is technically
> only working by accident.
>
> But if we get this right, we can have very nice semantics for when things
> like FMA are allowed to happen -- essentially the optimiser would have free
> reign between these explicit discard_excess_precision sequence points.
>

Fixing this is the goal I assume. :)

---
import std.stdio;

void test(double x, double y)
{
  double y2 = x + 1.0;
  if (y != y2) writeln("error");   // Prints 'error' under -O2
}

void main()
{
  immutable double x = .012;  // Removing 'immutable' and it works.
  double y = x + 1.0;

  test(x, y);
}
---


>
> After that, I'm a bit less sure. It does seem to me that we're trying to
> make 'real' do double-duty as meaning both "x87 80 bit floating-point
> number" and also as something like a storage class that is specific to
> double: "compiler, don't discard excess precision". Which are both useful
> concepts, but aren't identical. The two concepts did coincide on x86 32-bit,
> but they're different on x86-64. I think we need to distinguish the two.
>
> Ideally, I think we'd have a __real80 type. On x86 32 bit this would be the
> same as 'real', while on x86-64 __real80 would be available but probably
> 'real' would alias to double. But I'm a lot less certain about this.

There are flags for that in gdc:

-mlong-double-64
-mlong-double-80
-mlong-double-128


More information about the Digitalmars-d mailing list