A small comparison

Wed Nov 20 14:12:28 PST 2013

I've found a post about a functional library for LuaJIT code:

http://rtsisyk.github.io/luafun/intro.html

This dynamically typed Lua code:

require "fun" ()
  n = 100
  x = sum(map(function(x) return x^2 end, take(n, 
tabulate(math.sin))))
  -- calculate sum(sin(x)^2 for x in 0..n-1)
  print(x)
  50.011981355266

Gets compiled to:

  ->LOOP:
  0bcaffd0  movsd [rsp+0x8], xmm7
  0bcaffd6  addsd xmm4, xmm5
  0bcaffda  ucomisd xmm6, xmm1
  0bcaffde  jnb 0x0bca0028        ->6
  0bcaffe4  addsd xmm6, xmm0
  0bcaffe8  addsd xmm7, xmm0
  0bcaffec  fld qword [rsp+0x8]
  0bcafff0  fsin
  0bcafff2  fstp qword [rsp]
  0bcafff5  movsd xmm5, [rsp]
  0bcafffa  mulsd xmm5, xmm5
  0bcafffe  jmp 0x0bcaffd0        ->LOOP

So I've written a similar D version using Phobos:

void main() {
     import std.stdio, std.algorithm, std.range;
     enum n = 100;
     immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a 
+ b};
     r.writeln;
}

LDC2 compiles it to (32 bit):

LBB0_1:
     fxch
     fstpt   44(%esp)
     fld %st(0)
     fstpt   (%esp)
     fld1
     faddp   %st(1)
     fstpt   32(%esp)
     calll   __D3std4math3sinFNaNbNfeZe
     subl    $12, %esp
     fmul    %st(0), %st(0)
     fldt    44(%esp)
     faddp   %st(1)
     fstpt   44(%esp)
     fldt    32(%esp)
     fldt    44(%esp)
     decl    %esi
     fxch
     jne LBB0_1

As you see the D version doesn't use the fsin instruction as 
LuaJIT, it calls a library function.

If I import the sin from core.stdc.math ldc2 produces (notice the 
usage of xmm registers):

LBB0_1:
     movsd   %xmm1, 32(%esp)
     movsd   %xmm0, 40(%esp)
     movsd   %xmm1, (%esp)
     calll   _sin
     fstpl   48(%esp)
     movsd   48(%esp), %xmm0
     mulsd   %xmm0, %xmm0
     movsd   40(%esp), %xmm1
     addsd   %xmm0, %xmm1
     movsd   %xmm1, 40(%esp)
     movsd   32(%esp), %xmm1
     movsd   40(%esp), %xmm0
     addsd   LCPI0_0, %xmm1
     decl    %esi
     jne LBB0_1

I have also written a normal for loop in both C and D. This is C 
code:

#include "stdio.h"
#include "math.h"

int main() {
     double r = 0.0;
     int i;
     for (i = 0; i < 100; i++) {
         const double aux = sin(i);
         r += aux * aux;
     }
     printf("%f\n", r);
     return 0;
}

GCC compiles it to:

.L3:
     cvtsi2sd    %ebx, %xmm0
     call    sin
.L2:
     mulsd   %xmm0, %xmm0
     addl    $1, %ebx
     cmpl    $100, %ebx
     addsd   8(%rsp), %xmm0
     movsd   %xmm0, 8(%rsp)
     jne .L3

The Intel compiler compiles it to this (this is even vectorized, 
sin2 and mulpd work on two doubles at a time):

..B1.2:                         # Preds ..B1.8 ..B1.7
         cvtdq2pd  %xmm8, %xmm0                                  
#8.32
         call      __svml_sin2                                   
#8.28
         mulpd     %xmm0, %xmm0                                  
#9.20
         addb      $2, %r12b                                     
#7.5
         paddd     %xmm9, %xmm8                                  
#5.14
         addpd     %xmm0, %xmm10                                 
#9.9
         cmpb      $100, %r12b                                   
#7.5
         jb        ..B1.2        # Prob 99%                      
#7.5

In scientific code it's common to have small kernels that take 
most of the run-time of a program, so it's important to shave 
every instructions from such loops.

Do you think Phobos/LDC2 are doing well enough here?

Bye,
bearophile