core.simd woes

Mon Oct 15 07:07:45 PDT 2012

On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:
> On 15 October 2012 16:34, jerro <a at a.com> wrote:
>
>> On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:
>>
>>> On 15 October 2012 02:50, jerro <a at a.com> wrote:
>>>
>>>  Speaking of test – are they available somewhere? Now that 
>>> LDC at least
>>>>
>>>>> theoretically supports most of the GCC builtins, I'd like 
>>>>> to throw some
>>>>> tests at it to see what happens.
>>>>>
>>>>> David
>>>>>
>>>>>
>>>> I have a fork of std.simd with LDC support at
>>>> https://github.com/jerro/**
>>>> phobos/tree/std.simd 
>>>> <https://github.com/jerro/**phobos/tree/std.simd<https://github.com/jerro/phobos/tree/std.simd>>
>>>> and
>>>> some tests for it at 
>>>> https://github.com/jerro/std.****simd-tests<https://github.com/jerro/std.**simd-tests>
>>>> <https://github.**com/jerro/std.simd-tests<https://github.com/jerro/std.simd-tests>
>>>> >.
>>>>
>>>>
>>> Awesome. Pull request plz! :)
>>>
>>
>> I did change an API for a few functions like loadUnaligned, 
>> though. In
>> those cases the signatures needed to be changed because the 
>> functions used
>> T or T* for scalar parameters and return types and Vector!T 
>> for the vector
>> parameters and return types. This only compiles if T is a 
>> static array
>> which I don't think makes much sense. I changed those to take 
>> the vector
>> type as a template parameter. The vector type can not be 
>> inferred from the
>> scalar type because you can use vector registers of different 
>> sizes
>> simultaneously (with AVX, for example). Because of that the 
>> vector type
>> must be passed explicitly for some functions, so I made it the 
>> first
>> template parameter in those cases, so that Ver doesn't always 
>> need to be
>> specified.
>>
>> There is one more issue that I need to solve (and that may be 
>> a problem in
>> some cases with GDC too) - the pure, @safe and @nothrow 
>> attributes.
>> Currently gcc builtin declarations in LDC have none of those 
>> attributes (I
>> have to look into which of those can be added and if it can be 
>> done
>> automatically). I've just commented out the attributes in my 
>> std.simd fork
>> for now, but this isn't a proper solution.
>>
>>
>>
>>  That said, how did you come up with a lot of these 
>> implementations? Some
>>> don't look particularly efficient, and others don't even look 
>>> right.
>>> xor for instance:
>>> return cast(T) (cast(int4) v1 ^ cast(int4) v2);
>>>
>>> This is wrong for float types. x86 has separate instructions 
>>> for doing
>>> this
>>> to floats, which make sure to do the right thing by the flags 
>>> registers.
>>> Most of the LDC blocks assume that it could be any 
>>> architecture... I don't
>>> think this will produce good portable code. It needs to be 
>>> much more
>>> cafully hand-crafted, but it's a nice working start.
>>>
>>
>> The problem is that LLVM doesn't provide intrinsics for those 
>> operations.
>> The xor function does compile to a single xorps instruction 
>> when compiling
>> with -O1 or higher, though. I have looked at the code 
>> generated for many
>> (most, I think, but not for all possible types) of those LDC 
>> blocks and
>> most of them compile to the appropriate single instruction 
>> when compiled
>> with -O2 or -O3. Even the ones for which the D source code 
>> looks horribly
>> inefficient like for example loadUnaligned.
>>
>> By the way, clang does those in a similar way. For example, 
>> here is what
>> clang emits for a wrapper around _mm_xor_ps when compiled with 
>> -O1
>> -emit-llvm:
>>
>> define <4 x float> @foo(<4 x float> %a, <4 x float> %b) 
>> nounwind uwtable
>> readnone {
>>   %1 = bitcast <4 x float> %a to <4 x i32>
>>   %2 = bitcast <4 x float> %b to <4 x i32>
>>   %3 = xor <4 x i32> %1, %2
>>   %4 = bitcast <4 x i32> %3 to <4 x float>
>>   ret <4 x float> %4
>> }
>>
>> AFAICT, the only way to ensure that a certain instruction will 
>> be used
>> with LDC when there is no LLVM intrinsic for it is to use 
>> inline assembly
>> expressions. I remember having some problems with those in the 
>> past, but it
>> could be that I was doing something wrong. Maybe we should 
>> look into that
>> option too.
>>
>
> Inline assembly usually ruins optimising (code reordering 
> around inline asm
> blocks is usually considered impossible).

I don't see a reason why the compiler couldn't reorder code 
around GCC style inline assembly blocks. You are supposed to 
specify which registers are changed in the block. Doesn't that 
give the compiler enough information to reorder code?

> It's interesting that the x86 codegen makes such good sense of 
> those
> sequences, but I'm rather more concerned about other platforms. 
> I wonder if
> other platforms have a similarly incomplete subset of 
> intrinsics? :/

It looks to me like LLVM does provide intrinsics for those 
operation that can't be expressed in other ways. So my guess is 
that if some intrinsics are absolutely needed for some platform, 
they will probably be there. If an intrinsic is needed, I also 
don't see a reason why they wouldn't accept a patch that ads it.