Vector operations optimization.

Thu Mar 22 23:57:07 PDT 2012

On 23 March 2012 18:57, Comrad <comrad.karlovich at googlemail.com> wrote:
> On Thursday, 22 March 2012 at 10:43:35 UTC, Trass3r wrote:
>>>
>>> What is the status at the moment? What compiler and with which compiler
>>> flags I should use to achieve maximum performance?
>>
>>
>> In general gdc or ldc. Not sure how good vectorization is though, esp.
>> auto-vectorization.
>> On the other hand the so called vector operations like a[] = b[] + c[];
>> are lowered to hand-written SSE assembly even in dmd.
>
>
> I had such a snippet to test:
>
>  1 import std.stdio;
>  2 void main()
>  3 {
>  4   double[2] a=[1.,0.];
>  5   double[2] a1=[1.,0.];
>  6   double[2] a2=[1.,0.];
>  7   double[2] a3=[0.,0.];
>  8   foreach(i;0..1000000000)
>  9     a3[]+=a[]+a1[]*a2[];
>  10   writeln(a3);
>  11 }
>
> And I compared with the following d code:
>
>  1 import std.stdio;
>  2 void main()
>  3 {
>  4   double[2] a=[1.,0.];
>  5   double[2] a1=[1.,0.];
>  6   double[2] a2=[1.,0.];
>  7   double[2] a3=[0.,0.];
>  8   foreach(i;0..1000000000)
>  9   {
>  10     a3[0]+=a[0]+a1[0]*a2[0];
>  11     a3[1]+=a[1]+a1[1]*a2[1];
>  12   }
>  13   writeln(a3);
>  14 }
>
> And with the following c code:
>
>  1 #include  <stdio.h>
>  2 int main()
>  3 {
>  4   double a[2]={1.,0.};
>  5   double a1[2]={1.,0.};
>  6   double a2[2]={1.,0.};
>  7   double a3[2];
>  8   unsigned i;
>  9   for(i=0;i<1000000000;++i)
>  10   {
>  11     a3[0]+=a[0]+a1[0]*a2[0];
>  12     a3[1]+=a[1]+a1[1]*a2[1];
>  13   }
>  14   printf("%f %f\n",a3[0],a3[1]);
>  15   return 0;
>  16 }
>
> The last one I compiled with gcc two previous with dmd and ldc. C code with
> -O2
> was the fastest and as fast as d without slicing compiled with ldc. d code
> with slicing was 3 times slower (ldc compiler). I tried to compile with
> different optimization flags, that didn't help. Maybe I used the wrong ones.
> Can someone comment on this?

The flags you want are -O2, -inline -release.

If you don't have those, then that might explain some of the slow down
on slicing, since -release drops a ton of runtime checks.

Otherwise, I'm not sure why its so much slower, the druntime array ops
are written using SIMD instructions where available, so it should be
fast.

--
James Miller