Finding large difference b/w execution time of c++ and D codes for same problem

jerro a at a.com
Wed Feb 13 11:07:37 PST 2013


When you are comparing LDC and GDC, you should either use 
-mcpu=generic for ldc or -march=native for GDC, because their 
default targets are different. GDC will produce code that works 
on most x86_64 (if you are on a x86_64 system) CPUs by default, 
and LDC targets the host CPU. But this does not explain the 
difference in timings you are seeing here.

One reason why the code generaged by GDC is slower is that 
squarePlusMag isn't inlined. It seems that the fact that its 
parameter is const is somehow preventing it from being inlined - 
I have no idea why. Removing const and adding -march=native to 
gdc flags gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
   using floats Total time: 8.283 [sec]
   using doubles Total time: 6.827 [sec]
   using reals Total time: 6.795 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
   using floats Total time: 3.348 [sec]
   using doubles Total time: 3.08 [sec]
   using reals Total time: 4.174 [sec]

The difference is smaller, but still pretty large.

I have noticed that there are needless conversions in this code 
that are slowing down both GDC generated and LDC generated code. 
This code is a bit faster:

module main;

import std.datetime;
import std.metastrings;
import std.stdio;
import std.typetuple;


enum DIM = 32 * 1024;

int juliaValue;

template Julia(TReal)
{
     struct ComplexStruct
     {
         TReal r;
         TReal i;

         TReal squarePlusMag(ComplexStruct another)
         {
             TReal r1 = r*r - i*i + another.r;
             TReal i1 = cast(TReal)2.0*i*r + another.i;

             r = r1;
             i = i1;

             return (r1*r1 + i1*i1);
         }
     }

     int juliaFunction( int x, int y )
     {
         auto c = ComplexStruct(0.8, 0.156);
         auto a = ComplexStruct(x, y);

         foreach (i; 0 .. 200)
             if (a.squarePlusMag(c) > cast(TReal) 1000)
                 return 0;
         return 1;
     }

     void kernel()
     {
         foreach (x; 0 .. DIM) {
             foreach (y; 0 .. DIM) {
                 juliaValue = juliaFunction( x, y );
             }
         }
     }
}

void main()
{
     writeln("D code serial with dimension " ~ toStringNow!DIM ~ " 
...");
     StopWatch sw;
     foreach (Math; TypeTuple!(float, double, real))
     {
         sw.start();
         Julia!(Math).kernel();
         sw.stop();
         writefln("  using %ss Total time: %s [sec]",
                  Math.stringof, (sw.peek().msecs * 0.001));
         sw.reset();
     }
}

This gives me:

gdc -O3 -finline-functions -frelease tmp.d -o tmp -march=native:
   using floats Total time: 6.746 [sec]
   using doubles Total time: 6.872 [sec]
   using reals Total time: 5.226 [sec]

ldc2 -O3  -release -singleobj tmp.d -oftmp:
   using floats Total time: 2.36 [sec]
   using doubles Total time: 2.535 [sec]
   using reals Total time: 4.106 [sec]

At least part of the difference is due to the fact that 
juliaFunction still isn't getting inlined (but squarePlusMag is). 
Making juliaFunction a static method of ComplexStruct causes it 
to get inlined (again, I have no idea why). Moving juliaFunction 
inside ComplexStruct does not affect the performance of LDC 
generated code, but for GDC it gives me:

   using floats Total time: 4.262 [sec]
   using doubles Total time: 4.251 [sec]
   using reals Total time: 3.512 [sec]

There is still a large difference between LDC and GDC four floats 
and doubles and I can't explain it. But at least it is much 
smaller than it was initially.

I ran all the benchmarks on 64 bit linux, using core i5 2500k.


More information about the Digitalmars-d-learn mailing list