core.simd woes

Mon Oct 15 06:34:36 PDT 2012

On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:
> On 15 October 2012 02:50, jerro <a at a.com> wrote:
>
>> Speaking of test – are they available somewhere? Now that 
>> LDC at least
>>> theoretically supports most of the GCC builtins, I'd like to 
>>> throw some
>>> tests at it to see what happens.
>>>
>>> David
>>>
>>
>> I have a fork of std.simd with LDC support at 
>> https://github.com/jerro/**
>> phobos/tree/std.simd 
>> <https://github.com/jerro/phobos/tree/std.simd> and
>> some tests for it at 
>> https://github.com/jerro/std.**simd-tests<https://github.com/jerro/std.simd-tests>.
>>
>
> Awesome. Pull request plz! :)

I did change an API for a few functions like loadUnaligned, 
though. In those cases the signatures needed to be changed 
because the functions used T or T* for scalar parameters and 
return types and Vector!T for the vector parameters and return 
types. This only compiles if T is a static array which I don't 
think makes much sense. I changed those to take the vector type 
as a template parameter. The vector type can not be inferred from 
the scalar type because you can use vector registers of different 
sizes simultaneously (with AVX, for example). Because of that the 
vector type must be passed explicitly for some functions, so I 
made it the first template parameter in those cases, so that Ver 
doesn't always need to be specified.

There is one more issue that I need to solve (and that may be a 
problem in some cases with GDC too) - the pure, @safe and 
@nothrow attributes. Currently gcc builtin declarations in LDC 
have none of those attributes (I have to look into which of those 
can be added and if it can be done automatically). I've just 
commented out the attributes in my std.simd fork for now, but 
this isn't a proper solution.

> That said, how did you come up with a lot of these 
> implementations? Some
> don't look particularly efficient, and others don't even look 
> right.
> xor for instance:
> return cast(T) (cast(int4) v1 ^ cast(int4) v2);
>
> This is wrong for float types. x86 has separate instructions 
> for doing this
> to floats, which make sure to do the right thing by the flags 
> registers.
> Most of the LDC blocks assume that it could be any 
> architecture... I don't
> think this will produce good portable code. It needs to be much 
> more
> cafully hand-crafted, but it's a nice working start.

The problem is that LLVM doesn't provide intrinsics for those 
operations. The xor function does compile to a single xorps 
instruction when compiling with -O1 or higher, though. I have 
looked at the code generated for many (most, I think, but not for 
all possible types) of those LDC blocks and most of them compile 
to the appropriate single instruction when compiled with -O2 or 
-O3. Even the ones for which the D source code looks horribly 
inefficient like for example loadUnaligned.

By the way, clang does those in a similar way. For example, here 
is what clang emits for a wrapper around _mm_xor_ps when compiled 
with -O1 -emit-llvm:

define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind 
uwtable readnone {
   %1 = bitcast <4 x float> %a to <4 x i32>
   %2 = bitcast <4 x float> %b to <4 x i32>
   %3 = xor <4 x i32> %1, %2
   %4 = bitcast <4 x i32> %3 to <4 x float>
   ret <4 x float> %4
}

AFAICT, the only way to ensure that a certain instruction will be 
used with LDC when there is no LLVM intrinsic for it is to use 
inline assembly expressions. I remember having some problems with 
those in the past, but it could be that I was doing something 
wrong. Maybe we should look into that option too.