Performance updates

Bill Baxter wbaxter at gmail.com
Thu Nov 13 12:51:42 PST 2008


On Thu, Nov 13, 2008 at 8:19 PM, bearophile <bearophileHUGS at lycos.com> wrote:
> In this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD.
>
> I have added two of them to the list of slow things (as tiny benchmarks) I keep:
> http://www.fantascienza.net/leonardo/js/slow_d.zip
>
> ------------------------
>
> This is D code that just performs many integer divisions (by 7, known at compile time):
>
> int divideBySeven(int x) {
>    return x / 7;
> }
>
> void main() {
>    int i = int.max;
>    int r;
>    while (i--)
>        r = divideBySeven(i);
>    printf("%d\n", r);
> }
>
>
> The same code in C:
>
> #include "limits.h"
> #include "stdio.h"
>
> int divideBySeven(int x) {
>    return x / 7;
> }
>
> int main() {
>    int i = INT_MAX;
>    int r;
>    while (i--)
>        r = divideBySeven(i);
>
>    printf("%d\n", r);
>    return 0;
> }
>
>
> I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.
>
> Timing results:
> intdiv:
>  C:  5.04 s
>  D: 39.32 s
>
>
> The ASM generated by DMD for the function divideBySeven():
>
> mov     ECX,7
> cdq
> idiv    ECX
>
>
> The ASM generated by GCC for the function divideBySeven():
>
> pushl   %ebp
> movl    %esp, %ebp
> movl    8(%ebp), %ecx
> pushl   %ebx
> movl    $-1840700269, %ebx
> movl    %ebx, %eax
> popl    %ebx
> imull   %ecx
> popl    %ebp
> leal    (%edx,%ecx), %eax
> sarl    $2, %eax
> sarl    $31, %ecx
> subl    %ecx, %eax
>
> --------------------------
>
> This is a scanning, C code:
>
> #include "stdio.h"
>
> int main() {
>    int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
>    int i = 300000000;
>    while (i--) {
>        // 0.63 s
>        if (i % 4 == 0) {
>            counter0++;
>        } else if (i % 4 == 1) {
>            counter1++;
>        } else if (i % 4 == 2) {
>            counter2++;
>        } else {
>            counter3++;
>        }
>    }
>
>    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
>    return 0;
> }
>
>
> Equal D code:
>
> import std.stdio: printf;
>
> int main() {
>    int counter0, counter1, counter2, counter3;
>
>    int i = 300_000_000;
>    while (i--) { // 1.23 s
>        if (i % 4 == 0) {
>            counter0++;
>        } else if (i % 4 == 1) {
>            counter1++;
>        } else if (i % 4 == 2) {
>            counter2++;
>        } else {
>            counter3++;
>        }
>    }
>
>    printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
>    return 0;
> }
>
>
> Timings:
> Scan (i = 300_000_000):
>  C: 0.63 s
>  D: 1.23 s
>
> I can offer Asm code on request.
>
> ------------------------
>
> The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences:
>
> 20.66 seconds:
> http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1
>
> My classless version, 25.91 seconds:
> http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2
>
> That second version uses the following code in the most critical spot of the program:
>
> void advance(TyPlanets bodies, double dt) {
>    foreach(idx, ref b; bodies) {
>        double bm = b.mass;
>        foreach(ref b2; bodies[idx + 1 .. length]) {
>
> Removing the foreach, replacing them with for loops like the following, gives a significant performance increase, the code runs in about 19.5 second (estimated timing of the Shootout computer, I can give you exact timings on my PC if you want):
>
> static void advance(TyPlanets bodies, double dt) {
>    for (int idx; idx < bodies.length; idx++) {
>        auto b = &(bodies[idx]);
>        double bm = b.mass;
>        for (int j = idx + 1; j < bodies.length; j++) {
>            auto b2 = &(bodies[j]);
>
> (Note that I haven't just replaced foreach with a for, I have also removed the slicing bodies[idx+1..$], that also slows down the code a little, but I have seen that even removing just the slice the code is slower anyway).
>
> ------------------------
>
> Looking at assembly produced by the Intel compiler from C code that contains many floating point operations I can see how short and slick it is. I have also read few articles that show various ways to manage the floating point stack as efficiently as possible. Some of such methods are slow or quite slow but produce more efficient code.
> Using those refined methods for all the FP operations of a large program may lead to too much long compilation times. So profile-guided optimization may be used to find what are the hot parts of the code, to optimize the FP operations in them expecially well, and compile all the other FP parts with a faster compilation method.
>
> GCC has also the "hot" function attribute to let programmers manually define what functions are more critical:
> http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html
>
> In D in theory it can be used something like this:
> optimize { ... }
> (As static if { ... } it doesn't create a new scope).
>
> It tells the compiler that a part of code that uses lot of FP operations is speed critical, so the compiler can compile it with a quite slower floating point stack allocation algorithm, like ones I have read about.
>
> Bye,
> bearophile

*subliminal chant: L  D  C   L  D  C   L  D  C .....*

--bb



More information about the Digitalmars-d mailing list