[Issue 10636] New: Vector calling convention for D?
d-bugmail at puremagic.com
d-bugmail at puremagic.com
Sat Jul 13 13:41:14 PDT 2013
http://d.puremagic.com/issues/show_bug.cgi?id=10636
Summary: Vector calling convention for D?
Product: D
Version: D2
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: DMD
AssignedTo: nobody at puremagic.com
ReportedBy: bearophile_hugs at eml.cc
--- Comment #0 from bearophile_hugs at eml.cc 2013-07-13 13:41:13 PDT ---
VS2013 designers have added a new calling convention, that allows to pass SIMD
registers to functions avoiding the stack in most cases:
http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx
An example D program:
import core.stdc.stdio, core.simd;
struct Particle { float4 x, y; }
Particle addParticles(in Particle p1, in Particle p2)
pure nothrow {
return Particle(p1.x + p2.x, p1.y + p2.y);
}
void main() {
auto p1 = Particle([1, 2, 3, 4],
[10, 20, 30, 40]);
printf("%f %f %f %f %f %f %f %f\n",
p1.x.array[0], p1.x.array[1],
p1.x.array[2], p1.x.array[3],
p1.y.array[0], p1.y.array[1],
p1.y.array[2], p1.y.array[3]);
auto p2 = Particle([100, 200, 300, 400],
[1000, 2000, 3000, 4000]);
printf("%f %f %f %f %f %f %f %f\n",
p2.x.array[0], p2.x.array[1],
p2.x.array[2], p2.x.array[3],
p2.y.array[0], p2.y.array[1],
p2.y.array[2], p2.y.array[3]);
auto p3 = addParticles(p1, p2);
printf("%f %f %f %f %f %f %f %f\n",
p3.x.array[0], p3.x.array[1],
p3.x.array[2], p3.x.array[3],
p3.y.array[0], p3.y.array[1],
p3.y.array[2], p3.y.array[3]);
}
Comping that code with the ldc2 v.0.11.0 on Windows 32bit with:
ldc2 -O5 -disable-inlining -release -vectorize-slp -vectorize-slp-aggressive
-output-s test.d
It outputs the X86 asm:
__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movaps 40(%ebp), %xmm0
movaps 56(%ebp), %xmm1
addps 8(%ebp), %xmm0
addps 24(%ebp), %xmm1
movups %xmm1, 16(%eax)
movups %xmm0, (%eax)
movl %ebp, %esp
popl %ebp
ret $64
__Dmain:
...
movaps 160(%esp), %xmm0
movaps 176(%esp), %xmm1
movaps %xmm1, 48(%esp)
movaps %xmm0, 32(%esp)
movaps 128(%esp), %xmm0
movaps 144(%esp), %xmm1
movaps %xmm1, 16(%esp)
movaps %xmm0, (%esp)
leal 96(%esp), %eax
calll
__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle
subl $64, %esp
movss 96(%esp), %xmm0
movss 100(%esp), %xmm1
movss 104(%esp), %xmm2
movss 108(%esp), %xmm3
movss 112(%esp), %xmm4
movss 116(%esp), %xmm5
movss 120(%esp), %xmm6
movss 124(%esp), %xmm7
cvtss2sd %xmm7, %xmm7
movsd %xmm7, 60(%esp)
cvtss2sd %xmm6, %xmm6
movsd %xmm6, 52(%esp)
cvtss2sd %xmm5, %xmm5
movsd %xmm5, 44(%esp)
cvtss2sd %xmm4, %xmm4
movsd %xmm4, 36(%esp)
cvtss2sd %xmm3, %xmm3
movsd %xmm3, 28(%esp)
cvtss2sd %xmm2, %xmm2
movsd %xmm2, 20(%esp)
cvtss2sd %xmm1, %xmm1
movsd %xmm1, 12(%esp)
cvtss2sd %xmm0, %xmm0
movsd %xmm0, 4(%esp)
movl $_.str3, (%esp)
calll ___mingw_printf
xorl %eax, %eax
movl %ebp, %esp
popl %ebp
ret
As shown in that article, with a vector calling convention to set the arguments
of addParticles it needs only four movaps (instead of eigth and the leal). With
the vectr calling convention the body of addParticles gets short, because all
the needed operands are already in xmm registers. Probably the code of
addParticles becomes only two addps, a ret and maybe two movaps to put the
result in the right output registers.
D is meant to be useful for people that write fast video games, or other
numerical code, and both use plenty of SIMD code. So I think adding such
optimization can be useful. But I can't estimate how much advantage it's going
to give, benchmarks are needed. They write:
>Today on AMD64 target, passed by value vector arguments (such as __m128/__m256/) must be turned into a passed by address of a temporary buffer (i.e. $T1, $T2, $T3 in the figure above) allocated in caller's local stack as shown in the figure above. We have been receiving increasing concerns about this inefficiency in past years, especially from game, graphic, video/audio, and codec domains. A concrete example is MS XNA library in which passing vector arguments is a common pattern in many APIs of XNAMath library. The inefficiency will be intensified on upcoming AVX2/AVX3 and future processors with wider vector registers.<
On the other hand small functions get inlined, and introducing a new calling
convention has a disadvantage, as comment by Iain Buclaw:
> I'd vote for not adding more fluff which makes ABI differences
> between compilers greater. But it certainly looks like if would
> be useful if you wish to save the time taken to copy the vector
> from XMM registers onto the stack and back again when passing
> values around.
Maybe such vector calling convention will become more standard in future, as it
seems an useful improvement.
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
More information about the Digitalmars-d-bugs
mailing list