Misc questions:- licensing, VC++ IDE compatible, GPGPU, LTCG,

Sun May 16 12:45:01 PDT 2010

Walter Bright:

>This is not true of D. In D, the compiler can<

Thank you for your answers.
At the moment D compilers aren't doing this, I think. (LDC performs an optimization at link time). But it's nice that D leaves this optimization opportunity to future D compilers.

----------------------

If you have noticed that Html page lists three optimizations. The first one is the one you have explained.

The second optmizations it talks about is custom calling conventions:

>Normally, all functions are either cdecl, stdcall, or fastcall. With custom calling conventions, the back end has enough knowledge that it can pass more values in registers, and less on the stack. This usually cuts code size and improves performance.<

I have translated his demo code to D:

int foo(int i, int* j, int* k, int l) {
    *j = *k;
    *k = i + l;
    return i + *j + *k + l;
}
int main(char[][] args) {
    int i, j, k, l;
    l = i = args.length;
    int x = foo(i, &j, &k, l);
    return x * args.length;
}

This is how dmd compiles foo() (-O -release):

_D7stdcall3fooFiPiPiiZi	comdat
		push	EAX
		mov	ECX,8[ESP]
		mov	EDX,[ECX]
		push	EBX
		mov	EBX,010h[ESP]
		push	ESI
		mov	ESI,018h[ESP]
		push	EDI
		lea	EDI,[EAX][ESI]
		mov	[EBX],EDX
		mov	[ECX],EDI
		mov	EAX,[EBX]
		add	EAX,ESI
		add	EAX,EDI
		add	EAX,0Ch[ESP]
		pop	EDI
		pop	ESI
		pop	EBX
		pop	ECX
		ret	0Ch

This is how LDC compiles foo() with -O3 -release:

_D4test3fooFiPiPiiZi:
    pushl   %esi
    movl    8(%esp), %ecx
    movl    (%ecx), %edx
    movl    12(%esp), %esi
    movl    %edx, (%esi)
    addl    16(%esp), %eax
    movl    %eax, (%ecx)
    addl    %eax, %eax
    addl    (%esi), %eax
    popl    %esi
    ret $12

This is the asm of foo() shown in that article:

_foo:
    mov         ecx,dword ptr [eax]
    mov         dword ptr [esi],ecx
    lea         ecx,[edi+edx]
    mov         dword ptr [eax],ecx
    mov         eax,dword ptr [esi]    // *j
    add         eax,ecx                // *k sub-expression (from
    add         eax,edi                // l
    add         eax,edx                // i
    ret

It seems LDC isn't performing this optimization.

----------------------

The third optimizations it talks about is 'Small TLS Encoding':

>When you use __declspec(thread) variables, the code generator stores the variables at a fixed offset in each per-thread data area. Without LTCG, the code generator has no idea of how many __declspec(thread) variables there will be. As such, it must generate code that assumes the worst, and uses a four-byte offset to access the variable. With LTCG, the code generator has the opportunity to examine all __declspec(thread) variables, and note how often they're used. The code generator can put the smaller, more frequently used variables at the beginning of the per-thread data area and use a one-byte offset to access them.<

This is the C++ example code he uses:

__declspec(thread) int i = 1;
int main() {
    i = 4;
    return i;
}

The asm he shows without this optimization:
_main:
    mov         eax,dword ptr [__tls_index]
    mov         ecx,dword ptr fs:[2Ch]
    mov         ecx,dword ptr [ecx+eax*4]
    push        4
    pop         eax
    mov         dword ptr [ecx+4],eax
    ret

The asm he shows with this optimization:
_main:
    mov         eax,dword ptr fs:[0000002Ch]
    mov         ecx,dword ptr [eax]
    mov         eax,4
    mov         dword ptr [ecx+8],eax
    ret

I have translated that last C++ example in this D code:

int i = 1;
int main() {
    i = 4;
    return i;
}

I think I can't test this with LDC because it doesn't have TLS/__gshared.
dmd compiles it to:

__Dmain
		mov	ECX,FS:__tls_array
		mov	EDX,[ECX]
		mov	EAX,4
		mov	_D4test1ii[EDX],EAX
		ret

On this little example dmd seems to produce similar asm.

Bye,
bearophile