Misc questions:- licensing, VC++ IDE compatible, GPGPU, LTCG,
bearophile
bearophileHUGS at lycos.com
Sun May 16 12:45:01 PDT 2010
Walter Bright:
>This is not true of D. In D, the compiler can<
Thank you for your answers.
At the moment D compilers aren't doing this, I think. (LDC performs an optimization at link time). But it's nice that D leaves this optimization opportunity to future D compilers.
----------------------
If you have noticed that Html page lists three optimizations. The first one is the one you have explained.
The second optmizations it talks about is custom calling conventions:
>Normally, all functions are either cdecl, stdcall, or fastcall. With custom calling conventions, the back end has enough knowledge that it can pass more values in registers, and less on the stack. This usually cuts code size and improves performance.<
I have translated his demo code to D:
int foo(int i, int* j, int* k, int l) {
*j = *k;
*k = i + l;
return i + *j + *k + l;
}
int main(char[][] args) {
int i, j, k, l;
l = i = args.length;
int x = foo(i, &j, &k, l);
return x * args.length;
}
This is how dmd compiles foo() (-O -release):
_D7stdcall3fooFiPiPiiZi comdat
push EAX
mov ECX,8[ESP]
mov EDX,[ECX]
push EBX
mov EBX,010h[ESP]
push ESI
mov ESI,018h[ESP]
push EDI
lea EDI,[EAX][ESI]
mov [EBX],EDX
mov [ECX],EDI
mov EAX,[EBX]
add EAX,ESI
add EAX,EDI
add EAX,0Ch[ESP]
pop EDI
pop ESI
pop EBX
pop ECX
ret 0Ch
This is how LDC compiles foo() with -O3 -release:
_D4test3fooFiPiPiiZi:
pushl %esi
movl 8(%esp), %ecx
movl (%ecx), %edx
movl 12(%esp), %esi
movl %edx, (%esi)
addl 16(%esp), %eax
movl %eax, (%ecx)
addl %eax, %eax
addl (%esi), %eax
popl %esi
ret $12
This is the asm of foo() shown in that article:
_foo:
mov ecx,dword ptr [eax]
mov dword ptr [esi],ecx
lea ecx,[edi+edx]
mov dword ptr [eax],ecx
mov eax,dword ptr [esi] // *j
add eax,ecx // *k sub-expression (from
add eax,edi // l
add eax,edx // i
ret
It seems LDC isn't performing this optimization.
----------------------
The third optimizations it talks about is 'Small TLS Encoding':
>When you use __declspec(thread) variables, the code generator stores the variables at a fixed offset in each per-thread data area. Without LTCG, the code generator has no idea of how many __declspec(thread) variables there will be. As such, it must generate code that assumes the worst, and uses a four-byte offset to access the variable. With LTCG, the code generator has the opportunity to examine all __declspec(thread) variables, and note how often they're used. The code generator can put the smaller, more frequently used variables at the beginning of the per-thread data area and use a one-byte offset to access them.<
This is the C++ example code he uses:
__declspec(thread) int i = 1;
int main() {
i = 4;
return i;
}
The asm he shows without this optimization:
_main:
mov eax,dword ptr [__tls_index]
mov ecx,dword ptr fs:[2Ch]
mov ecx,dword ptr [ecx+eax*4]
push 4
pop eax
mov dword ptr [ecx+4],eax
ret
The asm he shows with this optimization:
_main:
mov eax,dword ptr fs:[0000002Ch]
mov ecx,dword ptr [eax]
mov eax,4
mov dword ptr [ecx+8],eax
ret
I have translated that last C++ example in this D code:
int i = 1;
int main() {
i = 4;
return i;
}
I think I can't test this with LDC because it doesn't have TLS/__gshared.
dmd compiles it to:
__Dmain
mov ECX,FS:__tls_array
mov EDX,[ECX]
mov EAX,4
mov _D4test1ii[EDX],EAX
ret
On this little example dmd seems to produce similar asm.
Bye,
bearophile
More information about the Digitalmars-d
mailing list