Virtual methods

Wed Feb 3 09:37:56 PST 2010

This post is long, I hope the web interface doesn't cut it down.

This post is about few simple experiments I've done in D about virtual calls and how to improve their performance. So probably this can't be useful before D2 is out of alpha state. For people working at Sun this stuff is probably very old and very naive, but I think currently no D compilers are optimizing this, and if you are not interested please ignore this post.

The performance problem caused by virtual calls is not the small amount of time they require, but mostly the missing inlining they cause.

I have created a small synthetic benchmark, not realistic but useful to measure this topic. The program creates a binary tree with 932_465 nodes, and then scans it 600 times computing the sum of the values stored in the leaves, incrementing such leave values each time.

The tree contains two different kinds of nodes, Internal and Leave, so the walking on the tree is done with a virtual function. HotSpot shows a significant performance improvement over the C++ code, probably caused by the de-virtualizing the virtual calls. GCC Profile Guided Optimization doesn't produce speedups on this program.

// sumtree_d1.d
// D code, works in D1 on Tango and Phobos
// not hard to adapt to D2
const bool USE_TRICKS = false;

version (Tango) {
    import tango.stdc.stdio: printf;
    static if (USE_TRICKS) {
        import tango.stdc.stdlib: exit;
        import tango.core.Memory: GC;
        alias GC.disable disable;
    }
} else {
    import std.c.stdio: printf;
    static if (USE_TRICKS) {
        import std.c.stdlib: exit;
        import std.gc: disable;
    }
}

struct Random { // portable generator
    private const int m = int.max;
    private const int a = 48271;
    private const int q = m / a;
    private const int r = m % a;
    public int seed;

    public double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    // returns in [0, max]
    public int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}

abstract class Node {
    abstract int walkSum();
}

final class Leaf : Node {
    int value;

    this(int x) { this.value = x; }

    override int walkSum() { return this.value++; }
}

final class Internal : Node {
    Node left, right;

    this(Node l, Node r) {
        this.left = l;
        this.right = r;
    }

    override int walkSum() {
        // createTree() never generates null pointers, but we test
        // anyway to be more general with other tree generators
        return ((this.left is null) ? 0 : this.left.walkSum()) +
               ((this.right is null) ? 0 : this.right.walkSum());
    }
}

struct SumTree {
    static int nnodes;
    static Random rnd;

    static Node createTree(int deptMax, int dept) {
        nnodes++;
        if (rnd.nextDouble() > 0.30 && dept <= deptMax)
            return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1));
        else
            return new Leaf(rnd.nextInt(100));
    }
}

void main() {
    const int NLOOPS = 600;
    static if (USE_TRICKS) { disable(); }

    SumTree.rnd = Random(142465964);
    SumTree.nnodes = 0;
    Node root = SumTree.createTree(35, 0); // 35 0
    printf("nnodes: %d\n", SumTree.nnodes);

    if (root !is null) {
        printf("sum: %d\n", root.walkSum());
        int tot = 0;
        for (int i = 0; i < NLOOPS-1; i++)
            tot += root.walkSum();
        printf("sum (%d scans): %d\n", NLOOPS-1, tot);
    }

    static if (USE_TRICKS) { exit(0); }
}

I have translated that code to Java and C++, ask if you want to see those versions.

I have created more D versions:

- D #2: uses tagged structs, the Node.walkSum() method dispatches according to the struct tag.
- D #3: the same as the #2 version, but now walkSum() is a free function, this probably improves code inlining.
- D #4: this uses structs that are not tagged. It tells apart leaves and internal nodes using pointers tagged with one bit (so the struct memory is allocated from the C heap). The total memory allocated is less (the tags require no extra memory) and this speeds up the allocation a little. The walk is about as fast.

In the D code I have also added a global constant boolean USE_TRICKS that enables/disables two dirty tricks, that essentially save some of the time used by the D GC to allocate and deallocate the tree. Such tricks don't speed up the tree walks.

The profiling of the C++ program shows that most of the time is spent inside Internal::walkSum().

Code compiled with:

LLVM 2.6 gcc version 4.2.1 (Based on Apple Inc. build 5649) (LLVM build)
llvm-g++ -Wall -O3 -s -Wl,--enable-stdcall-fixup -fomit-frame-pointer -march=native -ffast-math

Java build 1.7.0-ea-b76
java -server SumTree

CPU: Celeron 2.33 GHz, on Ubuntu running on VirtualBox running on Windows Vista 32 bit.

The current benchmark with seed=142465964 allocates 466_232 Internal nodes, and 466_234 Leaves.

Timings (just tree building), best of 3, seconds:
  C++:           0.21
  Java:          0.83
  D 1 notricks:  0.49
  D 2 notricks:  0.46
  D 3 notricks:  0.45
  D 4 notricks:  0.23
  D 1 tricks:    0.24
  D 2 tricks:    0.23
  D 3 tricks:    0.23
  D 4 tricks:    0.22

Timings (full program), best of 3, seconds:
  C++:          15.64
  Java:          8.74
  D 1 notricks: 10.30
  D 2 notricks: 14.26
  D 3 notricks: 14.06
  D 4 notricks: 13.57
  D 1 tricks:    9.77
  D 2 tricks:   13.58
  D 3 tricks:   13.32
  D 4 tricks:   13.06

As you can see from the timings it's not a matter of GC and memory allocations, the building phase in the C++ is actually faster compared to the allocation in the Java version. So probably here HotSpot is optimizing the virtual calls.

D #1:
Internal.walkSum

	pushl	%edi
	pushl	%esi
	subl	$4, %esp
	movl	8(%eax), %ecx
	testl	%ecx, %ecx
	movl	%eax, %esi
	je	.LBB5_5
	movl	(%ecx), %edx
	movl	%ecx, %eax
	call	*24(%edx)
.LBB5_2:
	movl	%eax, %edi
	movl	12(%esi), %eax
	testl	%eax, %eax
	je	.LBB5_6
	movl	(%eax), %ecx
	call	*24(%ecx)
.LBB5_4:
	addl	%edi, %eax
	addl	$4, %esp
	popl	%esi
	popl	%edi
	ret
.LBB5_5:
	xorl	%eax, %eax
	jmp	.LBB5_2
.LBB5_6:
	xorl	%eax, %eax
	jmp	.LBB5_4

------------------

D #2:
Internal.walkSum:

	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$4, %esp
	movl	4(%eax), %esi
	testl	%esi, %esi
	movl	%eax, %edi
	je	.LBB7_12
	cmpl	$0, (%esi)
	jne	.LBB7_13
	movl	4(%esi), %eax
	testl	%eax, %eax
	je	.LBB7_14
	call	_D2s24Node7walkSumMFZi
.LBB7_4:
	movl	%eax, %ebp
	movl	8(%esi), %eax
	testl	%eax, %eax
	je	.LBB7_15
	cmpl	$0, (%eax)
	jne	.LBB7_16
	call	_D2s28Internal7walkSumMFZi
	movl	%eax, %ebx
.LBB7_7:
	addl	%ebp, %ebx
.LBB7_8:
	movl	8(%edi), %eax
	testl	%eax, %eax
	je	.LBB7_17
	cmpl	$0, (%eax)
	jne	.LBB7_18
	call	_D2s28Internal7walkSumMFZi
	movl	%eax, %ecx
.LBB7_11:
	movl	%ecx, %eax
	addl	%ebx, %eax
	addl	$4, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	ret
.LBB7_12:
	xorl	%ebx, %ebx
	jmp	.LBB7_8
.LBB7_13:
	movl	4(%esi), %ebx
	leal	1(%ebx), %eax
	movl	%eax, 4(%esi)
	jmp	.LBB7_8
.LBB7_14:
	xorl	%eax, %eax
	jmp	.LBB7_4
.LBB7_15:
	xorl	%ebx, %ebx
	jmp	.LBB7_7
.LBB7_16:
	movl	4(%eax), %ebx
	leal	1(%ebx), %ecx
	movl	%ecx, 4(%eax)
	jmp	.LBB7_7
.LBB7_17:
	xorl	%ecx, %ecx
	jmp	.LBB7_11
.LBB7_18:
	movl	4(%eax), %ecx
	leal	1(%ecx), %edx
	movl	%edx, 4(%eax)
	jmp	.LBB7_11

------------------

D #3:

walkSum:

	pushl	%ebx
	pushl	%edi
	pushl	%esi
	testl	%eax, %eax
	jne	.LBB6_3
	xorl	%eax, %eax
.LBB6_2:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret
.LBB6_3:
	movl	%eax, %esi
	cmpl	$0, (%esi)
	jne	.LBB6_8
	movl	4(%esi), %eax
	call	_D2s37walkSumFPS2s34NodeZi
	movl	8(%esi), %esi
	testl	%esi, %esi
	movl	%eax, %edi
	je	.LBB6_9
	cmpl	$0, (%esi)
	jne	.LBB6_10
	movl	4(%esi), %eax
	call	_D2s37walkSumFPS2s34NodeZi
	movl	%eax, %ebx
	movl	8(%esi), %eax
	call	_D2s37walkSumFPS2s34NodeZi
	addl	%ebx, %eax
.LBB6_7:
	addl	%edi, %eax
	jmp	.LBB6_2
.LBB6_8:
	movl	4(%esi), %eax
	leal	1(%eax), %ecx
	movl	%ecx, 4(%esi)
	jmp	.LBB6_2
.LBB6_9:
	xorl	%eax, %eax
	jmp	.LBB6_7
.LBB6_10:
	movl	4(%esi), %eax
	leal	1(%eax), %ecx
	movl	%ecx, 4(%esi)
	jmp	.LBB6_7

------------------

D #4:

walkSum:

	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	%eax, %esi
	andl	$4294967294, %esi
	testl	%esi, %esi
	jne	.LBB13_3
	xorl	%eax, %eax
.LBB13_2:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret
.LBB13_3:
	testb	$1, %al
	movl	(%esi), %eax
	je	.LBB13_8
	movl	%eax, %edi
	andl	$4294967294, %edi
	testl	%edi, %edi
	je	.LBB13_9
	testb	$1, %al
	movl	(%edi), %eax
	je	.LBB13_10
	call	_D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
	movl	%eax, %ebx
	movl	4(%edi), %eax
	call	_D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
	addl	%ebx, %eax
.LBB13_7:
	movl	%eax, %edi
	movl	4(%esi), %eax
	call	_D2s47walkSumFS2s424__T13TaggedPointerTvVi1Z13TaggedPointerZi
	addl	%edi, %eax
	jmp	.LBB13_2
.LBB13_8:
	leal	1(%eax), %ecx
	movl	%ecx, (%esi)
	jmp	.LBB13_2
.LBB13_9:
	xorl	%eax, %eax
	jmp	.LBB13_7
.LBB13_10:
	leal	1(%eax), %ecx
	movl	%ecx, (%edi)
	jmp	.LBB13_7

------------------

st, C++, gcc:
Internal.walkSum:

	subl	$12, %esp
	movl	%ebx, 4(%esp)
	movl	%esi, 8(%esp)
	xorl	%ebx, %ebx
	movl	16(%esp), %esi
	movl	4(%esi), %edx
	testl	%edx, %edx
	je	.L5
	movl	(%edx), %eax
	movl	%edx, (%esp)
	call	*(%eax)
	movl	%eax, %ebx
.L5:
	movl	8(%esi), %edx
	xorl	%eax, %eax
	testl	%edx, %edx
	je	.L7
	movl	(%edx), %eax
	movl	%edx, (%esp)
	call	*(%eax)
.L7:
	addl	%ebx, %eax
	movl	8(%esp), %esi
	movl	4(%esp), %ebx
	addl	$12, %esp
	ret

------------------

st, C++, llvm-gcc:
Internal.walkSum:

	pushl	%edi
	pushl	%esi
	subl	$4, %esp
	movl	16(%esp), %esi
	movl	4(%esi), %eax
	testl	%eax, %eax
	je	LBB2_5
	movl	(%eax), %ecx
	movl	%eax, (%esp)
	call	*(%ecx)
LBB2_2:
	movl	%eax, %edi
	movl	8(%esi), %eax
	testl	%eax, %eax
	je	LBB2_6
	movl	(%eax), %ecx
	movl	%eax, (%esp)
	call	*(%ecx)
LBB2_4:
	addl	%edi, %eax
	addl	$4, %esp
	popl	%esi
	popl	%edi
	ret
LBB2_5:
	xorl	%eax, %eax
	jmp	LBB2_2
LBB2_6:
	xorl	%eax, %eax
	jmp	LBB2_4

I have collected the asm generated by HotSpot from the Internal.walkSum, but I am not able to read it, it's too much messy for me.

I have created an alternative version of the first D program, sumtree_d1b, that uses the __vptr to avoid the virtual table and to allow LDC to inline the code better. The performance is better than sumtree_d1, but not good enough to beat the Java version yet.

// sumtree_d1b.d
const bool USE_TRICKS = false;

version (Tango) {
    import tango.stdc.stdio: printf;
    static if (USE_TRICKS) {
        import tango.stdc.stdlib: exit;
        import tango.core.Memory: GC;
        alias GC.disable disable;
    }
} else {
    import std.c.stdio: printf;
    static if (USE_TRICKS) {
        import std.c.stdlib: exit;
        import std.gc: disable;
    }
}

struct Random {
    private const int m = int.max;
    private const int a = 48271;
    private const int q = m / a;
    private const int r = m % a;
    public int seed;

    public double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    // returns in [0, max]
    public int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}

//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2

abstract class Node {}

final class Leaf : Node {
    int value;

    this(int x) { this.value = x; }
}

final class Internal : Node {
    Node left, right;

    this(Node l, Node r) {
        this.left = l;
        this.right = r;
    }
}

struct SumTree {
    static int nnodes;
    static Random rnd;

    static Node createTree(int deptMax, int dept) {
        nnodes++;
        if (rnd.nextDouble() > 0.30 && dept <= deptMax)
            return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1));
        else
            return new Leaf(rnd.nextInt(100));
    }
}

int walkTree(Node root) {
    if (root.__vptr == leaf__vptr) {
        return (cast(Leaf)cast(void*)root).value++;
    } else {
        Internal iroot = cast(Internal)cast(void*)root;
        return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
               ((iroot.right is null) ? 0 : walkTree(iroot.right));
    }
}

void main() {
    const int NLOOPS = 600;
    static if (USE_TRICKS) { disable(); }

    {
        scope Leaf l = new Leaf(0);
        leaf__vptr = l.__vptr;
    }

    SumTree.rnd = Random(142465964);
    SumTree.nnodes = 0;
    Node root = SumTree.createTree(35, 0); // 35 0
    printf("nnodes: %d\n", SumTree.nnodes);

    if (root !is null) {
        printf("sum: %d\n", walkTree(root));
        int tot = 0;
        for (int i = 0; i < NLOOPS-1; i++)
            tot += walkTree(root);
        printf("sum (%d scans): %d\n", NLOOPS-1, tot);
    }

    static if (USE_TRICKS) { exit(0); }
}

Timings (full program), best of 3, seconds:
  Java:           8.83 (-server)
  D 1 notricks:  10.44
  D 1b notricks:  9.73
  D 1 tricks:    10.18
  D 1b tricks:    9.45

In this program there are only two possible classes of nodes, so to manually remove virtual calls I have stored the pointer to the virtual table of one of the two classes, the Leaf (Here unfortunately I can't use conditional compilation to let the compiler compile the right code according to the language version because the D2 version is not syntactially valid D1).

//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2

D docs state that user code shall not use the pointer to the virtual table (probably because it's an implementation detail, that can change according to the way virtual calls are implemented. For example dispatch trees don't have such pointer. Here I have essentially created a small dispatch tree manually).

At the beginning of the main() the leaf__vptr gets initialized (I don't know a way to initialize this value that doesn't require me the creation of a class instance):

{
    scope Leaf l = new Leaf(0);
    leaf__vptr = l.__vptr;
}

Then this is a free function that scans the tree (it can be written iterative too, but I think the performance is not that diffeernt on modern CPUs. This can be tested):

int walkTree(Node root) {
    if (root.__vptr == leaf__vptr) {
        return (cast(Leaf)cast(void*)root).value++;
    } else {
        Internal iroot = cast(Internal)cast(void*)root;
        return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
               ((iroot.right is null) ? 0 : walkTree(iroot.right));
    }
}

The double cast like:

cast(Leaf)cast(void*)root

is necessary because a single cast in D is done dynamically, so the first cast to void* punches a hole in the type system, so the pair becomes a static cast. I think it's similar to the C++ code:

Leaf::root

The asm of walkTree() shows no virtual calls and it also shows some inlining:

walkTree of sumtree_d1b.d:
_D8sumtree27walkTreeFC8sumtree24NodeZi:
    pushl   %ebx
    pushl   %edi
    pushl   %esi
    movl    _D8sumtree210leaf__vptrPPv, %ecx
    cmpl    %ecx, (%eax)
    movl    %eax, %esi
    je  .LBB8_12
    movl    8(%esi), %eax
    testl   %eax, %eax
    je  .LBB8_13
    call    _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_3:
    movl    %eax, %edi
    movl    12(%esi), %esi
    testl   %esi, %esi
    je  .LBB8_14
    movl    _D8sumtree210leaf__vptrPPv, %eax
    cmpl    %eax, (%esi)
    je  .LBB8_15
    movl    8(%esi), %eax
    testl   %eax, %eax
    je  .LBB8_16
    call    _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_7:
    movl    %eax, %ebx
    movl    12(%esi), %eax
    testl   %eax, %eax
    je  .LBB8_17
    call    _D8sumtree27walkTreeFC8sumtree24NodeZi
.LBB8_9:
    addl    %ebx, %eax
.LBB8_10:
    addl    %edi, %eax
.LBB8_11:
    popl    %esi
    popl    %edi
    popl    %ebx
    ret
.LBB8_12:
    movl    8(%esi), %eax
    leal    1(%eax), %ecx
    movl    %ecx, 8(%esi)
    jmp .LBB8_11
.LBB8_13:
    xorl    %eax, %eax
    jmp .LBB8_3
.LBB8_14:
    xorl    %eax, %eax
    jmp .LBB8_10
.LBB8_15:
    movl    8(%esi), %eax
    leal    1(%eax), %ecx
    movl    %ecx, 8(%esi)
    jmp .LBB8_10
.LBB8_16:
    xorl    %eax, %eax
    jmp .LBB8_7
.LBB8_17:
    xorl    %eax, %eax
    jmp .LBB8_9

A possible cause of the slow performance of sumtree_d1b compared to the Java version is in the cache effects. The garbage collector of Java HotSpot can lay objects in memory in a better way: In the "Eden" part of the memory allocated by the hotspot GC the cache coherence is very high. While the current D GC has to avoid long-term memory fragmentation, the Eden doesn't share this problem, it can be used as a stack.

So I have created a different version of the walkTree() function that prints the memory positions of the nodes. Then I have used Python to find the requences of the distances between node pointers during the walkTree() in sumtree_d1b:

>>> t = file("adds.txt").read()
>>> t2 = map(int, t.split())
>>> t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)]
>>> from util import frequency
>>> frequency(t3, mode="freq")
[(-16, 232285), (-32, 115525), (-48, 57657), (-64, 28535), (-80, 14365), (-96, 7266), (-112, 3447), (-128, 1721), (8160, 922), (-144, 917), (8176, 901), (8144, 688), (8128, 460), (-160, 431), (8112, 305), (-176, 218), (8096, 145), (-192, 111), (8080, 99), (-208, 61), (8064, 52), (8048, 28), (-224, 25), (8032, 20), (-240, 10), (8016, 10), (-272, 6), (8000, 5), (-288, 4), (7968, 3), (-256, 3), (-336, 1), (5497120, 1), (7952, 1), (16368, 1), (-1032304, 1), (-320, 1), (7936, 1)]

Those are quite jumpy, and it shows that the minimal memory block size is 16, so a Leaf object uses 16 bytes as an Internal object.

So I have overloaded the new() operator in the abstract Node class, allocating a single block of memory from the C heap at the beginning and then slicing it in pieces during node allocations, using it as a stack, as in the Eden of the Java GC. The pointer that new() returns isn't aligned to 16 bytes as the D specs require, so this program is just for experiments (I don't know why it has to be aligned to 16 bytes, maybe just for SSE matters, I think I have received no direct answer about this. An alignment to 8 bytes can be useful for tag bits added to pointers by the garbage collector when the GC runs). It uses just an exit() instead of raising the correct OutOfMemoryError. The bufsize size is found experimentally, this is hard-coded. A real program can't know this value, so it can for example allocate memory in chunks, for example with a dynamic array of pointers to the memory chunks plus a current position pointer in the last chunk and a pointr to the last position in the last chunk.

The resulting sumtree_d1c.d code is faster than the Java version:

// sumtree_d1c.d
version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.stdc.stdlib: exit, cmalloc = malloc, cfree = free;
} else {
    import std.c.stdio: printf;
    import std.c.stdlib: exit, cmalloc = malloc, cfree = free;
}

struct Random {
    private const int m = int.max;
    private const int a = 48271;
    private const int q = m / a;
    private const int r = m % a;
    public int seed;

    public double nextDouble() {
        int k = seed / q;
        seed = a * (seed - k * q) - r * k;
        if (seed < 1)
            seed += m;
        return cast(double)seed / m;
    }

    // returns in [0, max]
    public int nextInt(int max) {
        int n = max + 1;
        int k = cast(int)(n * this.nextDouble());
        return (k == n) ? k - 1 : k;
    }
}

//void** leaf__vptr; // D1
__gshared immutable(void*)* leaf__vptr; // D2

abstract class Node {
    static void[] buffer;
    static int bufindex;
    static const int bufsize = (466_232 * 16) + (466_234 * 12);

    static this() {
        void *p = cmalloc(bufsize);
        if (!p)
            exit(1); // out of memory
        buffer = p[0 .. bufsize];
    }

    static ~this() {
        if (buffer.length) {
            cfree(buffer.ptr);
            buffer = null;
        }
    }

    new(size_t sz) {
        void *p = &buffer[bufindex];
        bufindex += sz;
        if (bufindex <= buffer.length)
            return p;
        else {
            printf("out of memory");
            exit(1);
        }
        assert(0);
    }

    delete(void* p) {
        assert(0);
    }
}

final class Leaf : Node {
    int value;

    this(int x) { this.value = x; }
}

final class Internal : Node {
    Node left, right;

    this(Node l, Node r) {
        this.left = l;
        this.right = r;
    }
}

struct SumTree {
    static int nnodes;
    static Random rnd;

    static Node createTree(int deptMax, int dept) {
        nnodes++;
        if (rnd.nextDouble() > 0.30 && dept <= deptMax)
            return new Internal(createTree(deptMax, dept+1), createTree(deptMax, dept+1));
        else
            return new Leaf(rnd.nextInt(100));
    }
}

int walkTree(Node root) {
    if (root.__vptr == leaf__vptr) {
        return (cast(Leaf)cast(void*)root).value++;
    } else {
        Internal iroot = cast(Internal)cast(void*)root;
        return ((iroot.left is null) ? 0 : walkTree(iroot.left)) +
               ((iroot.right is null) ? 0 : walkTree(iroot.right));
    }
}

void main() {
    const int NLOOPS = 600;

    {
        scope Leaf l = new Leaf(0);
        leaf__vptr = l.__vptr;
    }

    SumTree.rnd = Random(142465964);
    SumTree.nnodes = 0;
    Node root = SumTree.createTree(35, 0); // 35 0  (Internal, Leaves: 466232 466234)
    printf("nnodes: %d\n", SumTree.nnodes);

    if (root !is null) {
        printf("sum: %d\n", walkTree(root));
        int tot = 0;
        for (int i = 0; i < NLOOPS-1; i++)
            tot += walkTree(root);
        printf("sum (%d scans): %d\n", NLOOPS-1, tot);
    }

    exit(0);
}

Timings (full program), best of 3, seconds:
  Java:           8.83 (-server)
  D 1 notricks:  10.44
  D 1b notricks:  9.73
  D 1 tricks:    10.18
  D 1b tricks:    9.45
  D 1c tricks:    8.25
  D 1d tricks:    8.12

The frequences of the distances between node pointers during the walkTree() in sumtree_d1d, show a much nicer position in memory, that probably improves cache coherence:

>>> t = file("adds.txt").read()
>>> t2 = map(int, t.split())
>>> t3 = [t2[i+1] - t2[i] for i in xrange(len(t2)-1)]
>>> from util import frequency
>>> frequency(t3, mode="freq")
[(12, 233188), (28, 116447), (44, 58345), (60, 28995), (76, 14672), (92, 7411), (108, 3546), (124, 1773), (140, 945), (156, 451), (172, 228), (188, 115), (204, 61), (220, 28), (236, 11), (268, 6), (284, 4), (252, 4), (316, 1), (332, 1)]

The sumtree_d1d.d version is the same as the sumtree_d1c.d version, but it swaps the null tests in walkTree(), because the CPU is faster if the "then" clause of the if is the most likely, and in this tree there are no null pointers:

int walkTree(Node root) {
    if (root.__vptr == leaf__vptr) {
        return (cast(Leaf)cast(void*)root).value++;
    } else {
        Internal iroot = cast(Internal)cast(void*)root;
        return ((iroot.left !is null) ? walkTree(iroot.left) : 0) +
               ((iroot.right !is null) ? walkTree(iroot.right): 0);
    }
}

The Java code doesn't seem to enjoy this change, probably because the run time profiling performs that optimization already.

With a less cheating program, overloading the new() in a correct and flexible way, this code can be faster anyway than the Java code (but I can't be sure before testing it).

I have implemented in D the well known "Richards" benchmark (used in the famous V8 JavaScript benchmarks of Google, the D version is about 8 times faster if well written and well compiled with ldc). I have done a similar experiment of manually removing the single virtual call present in the Richards benchmark. That virtual call is among 4 possible classes, and after the removal the code is a little slower. So that manual optimization is probably worth when in a virtual call there are only one or two possible classes to choose from. Ask me if you want the Richards benchmark in D, and various translations, variations, timings, etc.

I have read papers that say that the devirtualization, if well implemented can also be useful in other situations. For example even if there are many possible calls, but only one of them is much more frequent than all the other ones, then in the call point you can put an if to test if it's the most frequent class, and perform a static call to it, otherwise you can perform a dynamic call. This usually can speed up the code.

There are many ways to improve this situation in D, I think we can start with something simple:
1) Give the dmd a function (if not already present, but I think it's not present) that for a call point returns an array of the linearized inheritance tree. If it contains only a class (beside object) then this call can become static. In D methods are virtual by default, so this simple optimization is probably able to remove many useless virtual calls. So there's no need to turn all classes into "static", keeping classes more flexible.
2) If not already present it can be added to the profiler built-in in DMD the collection of statistics at the virtual call points. Collecting full statistics as I have done with an associative array is probably not that useful, the data structure used can be much simpler, something like a static fixed sized array about 4 or 5 items long, each item is a ID-frequency pair, that gets incremented by 1. Such array can have a move-to-front strategy. And beside this little arrays there can be variable that counts total calls for that virtual call. In D a linear scan in a such small array is simple to implement and it's faster than an an associative array access. if there are more than 4-5 different calls, then they get not recorded. The total call count in the locus is useful to find how many calls are not registered in the array. Later the compiler can use those frequencies at each call point. If there are only 2 different calls, it can use a tiny dispatch tree. If there is a call much more frequent than the other ones can use a static call for the frequent one and a dynamic call for the others, otherwise it can use a dynamic call.

Later more things can be improved, but I think this is enough to fix most of the situation.

Bye,
bearophile