SIMD benchmark

Tue Jan 17 02:04:11 PST 2012

On Tue, 17 Jan 2012 09:42:12 +0100, Don Clugston <dac at nospam.com> wrote:

> On 16/01/12 17:51, Martin Nowak wrote:
>> On Mon, 16 Jan 2012 17:17:44 +0100, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> wrote:
>>
>>> On 1/15/12 12:56 AM, Walter Bright wrote:
>>>> I get a 2 to 2.5 speedup with the vector instructions on 64 bit Linux.
>>>> Anyhow, it's good enough now to play around with. Consider it alpha
>>>> quality. Expect bugs - but make bug reports, as there's a serious lack
>>>> of source code to test it with.
>>>> -----------------------
>>>> import core.simd;
>>>>
>>>> void test1a(float[4] a) { }
>>>>
>>>> void test1()
>>>> {
>>>> float[4] a = 1.2;
>>>> a[] = a[] * 3 + 7;
>>>> test1a(a);
>>>> }
>>>>
>>>> void test2a(float4 a) { }
>>>>
>>>> void test2()
>>>> {
>>>> float4 a = 1.2;
>>>> a = a * 3 + 7;
>>>> test2a(a);
>>>> }
>>>
>>> These two functions should have the same speed. The function that
>>> ought to be slower is:
>>>
>>> void test1()
>>> {
>>> float[5] a = 1.2;
>>> float[] b = a[1 .. $];
>>> b[] = b[] * 3 + 7;
>>> test1a(a);
>>> }
>>>
>>>
>>> Andrei
>>
>> Unfortunately druntime's array ops are a mess and fail
>> to speed up anything below 16 floats.
>> Additionally there is overhead for a function call and
>> they have to check alignment at runtime.
>>
>> martin
>
> Yes. The structural problem in the compiler is that array ops get turned  
> into function calls far too early. It happens in the semantic pass, but  
> it shouldn't happen in the front-end at all -- it should be done in the  
> glue layer, at the beginning of code generation.
>
> Incidentally, this is the reason that CTFE doesn't work with array ops.
>
>
>
Oh, I was literally speaking of the runtime implementation.
It should loop with 4 XMM regs the continue with 1 XMM reg
and finish up scalar.
Right now it quantizes on 16 floats and does the remaining
ones scalar, which is really bad for very small arrays.

I was about to rewrite it at some point.
https://gist.github.com/1235470

I think having a runtime template is better than
doing this massive extern(C) interface that has
to be kept in sync. That would also open up room
for a better CTFE integration.

martin