core.simd woes
F i L
witte2008 at gmail.com
Mon Aug 6 18:24:07 PDT 2012
On Monday, 6 August 2012 at 15:15:30 UTC, Manu wrote:
> I think core.simd is only designed for the lowest level of
> access to the
> SIMD hardware. I started writing std.simd some time back; it is
> mostly
> finished in a fork, but there are some bugs/missing features in
> D's SIMD
> support preventing me from finishing/releasing it. (incomplete
> dmd
> implementation, missing intrinsics, no SIMD literals, can't do
> unit
> testing, etc)
Yes, I found, and have been referring to, your std.simd library
for awhile now. Even with your library having GDC only support
AtM, it's been a help. Thank you.
> The intention was that std.simd would be flat C-style api,
> which would be
> the lowest level required for practical and portable use.
> It's almost done, and it should make it a lot easier for people
> to build
> their own SIMD libraries on top. It supplies most useful linear
> algebraic
> operations, and implements them as efficiently as possible for
> other
> architectures than just SSE.
> Take a look:
> https://github.com/TurkeyMan/phobos/blob/master/std/simd.d
Right now I'm working with DMD on Linux x86_64. LDC doesn't
support SIMD right now, and I haven't built GDC yet, so I can't
do performance comparisons between the two. I really need to get
around to setting up GDC, because I've always planned on using
that as a "release compiler" for my code.
The problem is, as I mentioned above, that performance of SIMD
completely get's shot when wrapping a float4 into a struct,
rather than using float4 directly. There are some places where
(like matrices), where they do make a big impact, but I'm trying
to find the best solution for general code. For instance my
current math library looks like:
struct Vector4 { float x, y, z, w; ... }
struct Matrix4 { Vector4 x, y, z, w; ... }
but I was planning on changing over to (something like):
alias float4 Vector4;
alias float4[4] Matrix4;
So I could use the types directly and reap the performance gains.
I'm currently doing this to both my D code (still in early
state), and our C# code for Mono. Both core.simd and Mono.Simd
have "compiler magic" vector types, but Mono's version gives me
access to component channels and simple constructors I can use,
so for user code (and types like the Matrix above, with internal
vectors) it's very convenient and natural. D's simply isn't, and
I'm not sure there's any ways around it since again, at least
with DMD, performance is shot when I put it in a struct.
> On a side note, your example where you're performing a scalar
> add within a
> vector; this is bad, don't ever do this.
> SSE (ie, x86) is the most tolerant architecture in this regard,
> but it's
> VERY bad SIMD design. You should never perform any
> component-wise
> arithmetic when working with SIMD; It's absolutely not portable.
> Basically, a good rule of thumb is, if the keyword 'float'
> appears anywhere
> that interacts with your SIMD code, you are likely to see worse
> performance
> than just using float[4] on most architectures.
> Better to factor your code to eliminate any scalar work, and
> make sure
> 'scalars' are broadcast across all 4 components and continue
> doing 4d
> operations.
>
> Instead of: @property pure nothrow float x(float4 v) { return
> v.ptr[0]; }
> Better to use: @property pure nothrow float4 x(float4 v) {
> return
> swizzle!"xxxx"(v); }
Thanks a lot for telling me this, I don't know much about SIMD
stuff. You're actually the exact person I wanted to talk to,
because you do know a lot about this and I've always respected
your opinions.
I'm not apposed to doing something like:
float4 addX(ref float4 v, float val)
{
float4 f;
f.x = val
v += f;
}
to do single component scalars, but it's very inconvenient for
users to remember to use:
vec.addX(scalar);
instead of:
vec.x += scalar;
But that wouldn't be an issue if I could write custom operators
for the components what basically did that. But I can't without
wrapping float, which is why I am requesting these magic types
get some basic features like that.
I'm wondering if I should be looking at just using inlined ASM
and use the ASM SIMD instructions directly. I know basic ASM, but
I don't know what the potential pitfalls of doing that,
especially with portability. Is there a reason not to do this
(short of complexity)? I'm also wondering why wrapping a
core.simd type into a struct completely negates performance.. I'm
guessing because when I return the struct type, the compiler has
to think about it as a struct, instead of it's "magic" type and
all struct types have a bit more overhead.
On a side note, DMD without SIMD is much faster than C# without
SIMD, by a factor of 8x usually on simple vector types
(micro-benchmarks), and that's not counting the runtimes startup
times either. However, when I use Mono.Simd, both DMD (with
core.simd) and C# are similar performance (see below). Math code
with Mono C# (with SIMD) actually runs faster on Linux (even
without the SGen GC or LLVM codegen) than it does on Window 8
with MS .NET, which I find to be pretty impressive and
encouraging for our future games with Mono on Android (which has
been out biggest performance PITA platform so far).
I've noticed some really odd things with core.simd as well, which
is another reason I'm thing of trying inlined ASM. I'm not sure
what's causing certain compiler optimizations. For instance,
given the basic test program, when I do:
float rand = ...; // user input value
float4 a, b = [1, 4, -12, 5];
a.ptr[0] = rand;
a.ptr[1] = rand + 1;
a.ptr[2] = rand + 2;
a.ptr[3] = rand + 3;
ulong mil;
StopWatch sw;
foreach (t; 0 .. testCount)
{
sw.start();
foreach (i; 0 .. 1_000_000)
{
a += b;
b -= a;
}
sw.stop();
mil += sw.peek().msecs;
sw.reset();
}
writeln(a.array, ", ", b.array);
writeln(cast(double) mil / testCount);
When I run this on my Phenom II X4 920, it completes in ~9ms. For
comparison, C# Mono.Simd gets almost identical performance with
identical code. However, if I add:
auto vec4(float x, float y, float z, float w)
{
float4 result;
result.ptr[0] = x;
result.ptr[1] = y;
result.ptr[2] = z;
result.ptr[3] = w;
return result;
}
then replace the vector initialization lines:
float4 a, b = [ ... ];
a.ptr[0] = rand;
...
with ones using my factory function:
auto a = vec4(rand, rand+1, rand+2, rand+3);
auto b = vec4(1, 4, -12, 5);
Then the program consistently completes in 2.15ms...
wtf right? The printed vector output is identical, and there's no
changes to the loop code (a += b, etc), I just change the
construction code of the vectors and it runs 4.5x faster. Beats
me, but I'll take it. Btw, for comparison, if I use a struct with
an internal float4 it runs in ~19ms, and a struct with four
floats runs in ~22ms. So you can see my concerns with using
core.simd types directly, especially when my Intel Mac gets even
better improvements with SIMD code.
I haven't done extensive test on the Intel, but my original test
(the one above, only in C# using Mono.Simd) the results for ~55ms
using a struct with internal float4, and ~5ms for using float4
directly.
anyways, thanks for the feedback.
More information about the Digitalmars-d
mailing list