core.simd woes

Mon Oct 15 07:45:15 PDT 2012

On 15 October 2012 17:07, jerro <a at a.com> wrote:

> On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote:
>
>> On 15 October 2012 16:34, jerro <a at a.com> wrote:
>>
>>  On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:
>>>
>>>  On 15 October 2012 02:50, jerro <a at a.com> wrote:
>>>>
>>>>  Speaking of test – are they available somewhere? Now that LDC at least
>>>>
>>>>>
>>>>>  theoretically supports most of the GCC builtins, I'd like to throw
>>>>>> some
>>>>>> tests at it to see what happens.
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>>  I have a fork of std.simd with LDC support at
>>>>> https://github.com/jerro/**
>>>>> phobos/tree/std.simd <https://github.com/jerro/****
>>>>> phobos/tree/std.simd <https://github.com/jerro/**phobos/tree/std.simd>
>>>>> <https://**github.com/jerro/phobos/tree/**std.simd<https://github.com/jerro/phobos/tree/std.simd>
>>>>> >>
>>>>> and
>>>>> some tests for it at https://github.com/jerro/std.******simd-tests<https://github.com/jerro/std.****simd-tests>
>>>>> <https://github.**com/jerro/std.**simd-tests<https://github.com/jerro/std.**simd-tests>
>>>>> >
>>>>> <https://github.**com/jerro/**std.simd-tests<https://github.**
>>>>> com/jerro/std.simd-tests <https://github.com/jerro/std.simd-tests>>
>>>>> >.
>>>>>
>>>>>
>>>>>  Awesome. Pull request plz! :)
>>>>
>>>>
>>> I did change an API for a few functions like loadUnaligned, though. In
>>> those cases the signatures needed to be changed because the functions
>>> used
>>> T or T* for scalar parameters and return types and Vector!T for the
>>> vector
>>> parameters and return types. This only compiles if T is a static array
>>> which I don't think makes much sense. I changed those to take the vector
>>> type as a template parameter. The vector type can not be inferred from
>>> the
>>> scalar type because you can use vector registers of different sizes
>>> simultaneously (with AVX, for example). Because of that the vector type
>>> must be passed explicitly for some functions, so I made it the first
>>> template parameter in those cases, so that Ver doesn't always need to be
>>> specified.
>>>
>>> There is one more issue that I need to solve (and that may be a problem
>>> in
>>> some cases with GDC too) - the pure, @safe and @nothrow attributes.
>>> Currently gcc builtin declarations in LDC have none of those attributes
>>> (I
>>> have to look into which of those can be added and if it can be done
>>> automatically). I've just commented out the attributes in my std.simd
>>> fork
>>> for now, but this isn't a proper solution.
>>>
>>>
>>>
>>>  That said, how did you come up with a lot of these implementations? Some
>>>
>>>> don't look particularly efficient, and others don't even look right.
>>>> xor for instance:
>>>> return cast(T) (cast(int4) v1 ^ cast(int4) v2);
>>>>
>>>> This is wrong for float types. x86 has separate instructions for doing
>>>> this
>>>> to floats, which make sure to do the right thing by the flags registers.
>>>> Most of the LDC blocks assume that it could be any architecture... I
>>>> don't
>>>> think this will produce good portable code. It needs to be much more
>>>> cafully hand-crafted, but it's a nice working start.
>>>>
>>>>
>>> The problem is that LLVM doesn't provide intrinsics for those operations.
>>> The xor function does compile to a single xorps instruction when
>>> compiling
>>> with -O1 or higher, though. I have looked at the code generated for many
>>> (most, I think, but not for all possible types) of those LDC blocks and
>>> most of them compile to the appropriate single instruction when compiled
>>> with -O2 or -O3. Even the ones for which the D source code looks horribly
>>> inefficient like for example loadUnaligned.
>>>
>>> By the way, clang does those in a similar way. For example, here is what
>>> clang emits for a wrapper around _mm_xor_ps when compiled with -O1
>>> -emit-llvm:
>>>
>>> define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind uwtable
>>> readnone {
>>>   %1 = bitcast <4 x float> %a to <4 x i32>
>>>   %2 = bitcast <4 x float> %b to <4 x i32>
>>>   %3 = xor <4 x i32> %1, %2
>>>   %4 = bitcast <4 x i32> %3 to <4 x float>
>>>   ret <4 x float> %4
>>> }
>>>
>>> AFAICT, the only way to ensure that a certain instruction will be used
>>> with LDC when there is no LLVM intrinsic for it is to use inline assembly
>>> expressions. I remember having some problems with those in the past, but
>>> it
>>> could be that I was doing something wrong. Maybe we should look into that
>>> option too.
>>>
>>>
>> Inline assembly usually ruins optimising (code reordering around inline
>> asm
>> blocks is usually considered impossible).
>>
>
> I don't see a reason why the compiler couldn't reorder code around GCC
> style inline assembly blocks. You are supposed to specify which registers
> are changed in the block. Doesn't that give the compiler enough information
> to reorder code?

Not necessarily. If you affect various flags registers or whatever, or
direct memory access might violate it's assumptions about the state of
memory/stack.
I don't think I've come in contact with any compiler's that aren't
super-conservative about this sort of thing.

 It's interesting that the x86 codegen makes such good sense of those
>> sequences, but I'm rather more concerned about other platforms. I wonder
>> if
>> other platforms have a similarly incomplete subset of intrinsics? :/
>>
>
> It looks to me like LLVM does provide intrinsics for those operation that
> can't be expressed in other ways. So my guess is that if some intrinsics
> are absolutely needed for some platform, they will probably be there. If an
> intrinsic is needed, I also don't see a reason why they wouldn't accept a
> patch that ads it.
>

Fair enough. Interesting to know. This means that cross-platform LDC SIMD
code will need to be thoroughly scrutinised for codegen quality in all
targets.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20121015/54c988e5/attachment-0001.html>