std.math performance (SSE vs. real)

Sat Jun 28 04:30:45 PDT 2014

On 27 June 2014 10:37, hane via Digitalmars-d
<digitalmars-d at puremagic.com> wrote:
> On Friday, 27 June 2014 at 06:48:44 UTC, Iain Buclaw via Digitalmars-d
> wrote:
>>
>> On 27 June 2014 07:14, Iain Buclaw <ibuclaw at gdcproject.org> wrote:
>>>
>>> On 27 June 2014 02:31, David Nadlinger via Digitalmars-d
>>> <digitalmars-d at puremagic.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> right now, the use of std.math over core.stdc.math can cause a huge
>>>> performance problem in typical floating point graphics code. An instance
>>>> of
>>>> this has recently been discussed here in the "Perlin noise benchmark
>>>> speed"
>>>> thread [1], where even LDC, which already beat DMD by a factor of two,
>>>> generated code more than twice as slow as that by Clang and GCC. Here,
>>>> the
>>>> use of floor() causes trouble. [2]
>>>>
>>>> Besides the somewhat slow pure D implementations in std.math, the
>>>> biggest
>>>> problem is the fact that std.math almost exclusively uses reals in its
>>>> API.
>>>> When working with single- or double-precision floating point numbers,
>>>> this
>>>> is not only more data to shuffle around than necessary, but on x86_64
>>>> requires the caller to transfer the arguments from the SSE registers
>>>> onto
>>>> the x87 stack and then convert the result back again. Needless to say,
>>>> this
>>>> is a serious performance hazard. In fact, this accounts for an 1.9x
>>>> slowdown
>>>> in the above benchmark with LDC.
>>>>
>>>> Because of this, I propose to add float and double overloads (at the
>>>> very
>>>> least the double ones) for all of the commonly used functions in
>>>> std.math.
>>>> This is unlikely to break much code, but:
>>>>  a) Somebody could rely on the fact that the calls effectively widen the
>>>> calculation to 80 bits on x86 when using type deduction.
>>>>  b) Additional overloads make e.g. "&floor" ambiguous without context,
>>>> of
>>>> course.
>>>>
>>>> What do you think?
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>
>>> This is the reason why floor is slow, it has an array copy operation.
>>>
>>> ---
>>>   auto vu = *cast(ushort[real.sizeof/2]*)(&x);
>>> ---
>>>
>>> I didn't like it at the time I wrote, but at least it prevented the
>>> compiler (gdc) from removing all bit operations that followed.
>>>
>>> If there is an alternative to the above, then I'd imagine that would
>>> speed up floor by tenfold.
>>>
>>
>> Can you test with this?
>>
>> https://github.com/D-Programming-Language/phobos/pull/2274
>>
>> Float and Double implementations of floor/ceil are trivial and I can add
>> later.
>
>
> Nice! I tested with the Perlin noise benchmark, and it got faster(in my
> environment, 1.030s -> 0.848s).
> But floor still consumes almost half of the execution time.

I've done some further improvements in that PR.  I'd imagine you'd see
a little more juice squeezed out.