std.math performance (SSE vs. real)
hane via Digitalmars-d
digitalmars-d at puremagic.com
Fri Jun 27 02:37:53 PDT 2014
On Friday, 27 June 2014 at 06:48:44 UTC, Iain Buclaw via
Digitalmars-d wrote:
> On 27 June 2014 07:14, Iain Buclaw <ibuclaw at gdcproject.org>
> wrote:
>> On 27 June 2014 02:31, David Nadlinger via Digitalmars-d
>> <digitalmars-d at puremagic.com> wrote:
>>> Hi all,
>>>
>>> right now, the use of std.math over core.stdc.math can cause
>>> a huge
>>> performance problem in typical floating point graphics code.
>>> An instance of
>>> this has recently been discussed here in the "Perlin noise
>>> benchmark speed"
>>> thread [1], where even LDC, which already beat DMD by a
>>> factor of two,
>>> generated code more than twice as slow as that by Clang and
>>> GCC. Here, the
>>> use of floor() causes trouble. [2]
>>>
>>> Besides the somewhat slow pure D implementations in std.math,
>>> the biggest
>>> problem is the fact that std.math almost exclusively uses
>>> reals in its API.
>>> When working with single- or double-precision floating point
>>> numbers, this
>>> is not only more data to shuffle around than necessary, but
>>> on x86_64
>>> requires the caller to transfer the arguments from the SSE
>>> registers onto
>>> the x87 stack and then convert the result back again.
>>> Needless to say, this
>>> is a serious performance hazard. In fact, this accounts for
>>> an 1.9x slowdown
>>> in the above benchmark with LDC.
>>>
>>> Because of this, I propose to add float and double overloads
>>> (at the very
>>> least the double ones) for all of the commonly used functions
>>> in std.math.
>>> This is unlikely to break much code, but:
>>> a) Somebody could rely on the fact that the calls
>>> effectively widen the
>>> calculation to 80 bits on x86 when using type deduction.
>>> b) Additional overloads make e.g. "&floor" ambiguous without
>>> context, of
>>> course.
>>>
>>> What do you think?
>>>
>>> Cheers,
>>> David
>>>
>>
>> This is the reason why floor is slow, it has an array copy
>> operation.
>>
>> ---
>> auto vu = *cast(ushort[real.sizeof/2]*)(&x);
>> ---
>>
>> I didn't like it at the time I wrote, but at least it
>> prevented the
>> compiler (gdc) from removing all bit operations that followed.
>>
>> If there is an alternative to the above, then I'd imagine that
>> would
>> speed up floor by tenfold.
>>
>
> Can you test with this?
>
> https://github.com/D-Programming-Language/phobos/pull/2274
>
> Float and Double implementations of floor/ceil are trivial and
> I can add later.
Nice! I tested with the Perlin noise benchmark, and it got
faster(in my environment, 1.030s -> 0.848s).
But floor still consumes almost half of the execution time.
More information about the Digitalmars-d
mailing list