core.simd woes

Manu turkeyman at gmail.com
Mon Oct 15 06:43:18 PDT 2012


On 15 October 2012 16:34, jerro <a at a.com> wrote:

> On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote:
>
>> On 15 October 2012 02:50, jerro <a at a.com> wrote:
>>
>>  Speaking of test – are they available somewhere? Now that LDC at least
>>>
>>>> theoretically supports most of the GCC builtins, I'd like to throw some
>>>> tests at it to see what happens.
>>>>
>>>> David
>>>>
>>>>
>>> I have a fork of std.simd with LDC support at
>>> https://github.com/jerro/**
>>> phobos/tree/std.simd <https://github.com/jerro/**phobos/tree/std.simd<https://github.com/jerro/phobos/tree/std.simd>>
>>> and
>>> some tests for it at https://github.com/jerro/std.****simd-tests<https://github.com/jerro/std.**simd-tests>
>>> <https://github.**com/jerro/std.simd-tests<https://github.com/jerro/std.simd-tests>
>>> >.
>>>
>>>
>> Awesome. Pull request plz! :)
>>
>
> I did change an API for a few functions like loadUnaligned, though. In
> those cases the signatures needed to be changed because the functions used
> T or T* for scalar parameters and return types and Vector!T for the vector
> parameters and return types. This only compiles if T is a static array
> which I don't think makes much sense. I changed those to take the vector
> type as a template parameter. The vector type can not be inferred from the
> scalar type because you can use vector registers of different sizes
> simultaneously (with AVX, for example). Because of that the vector type
> must be passed explicitly for some functions, so I made it the first
> template parameter in those cases, so that Ver doesn't always need to be
> specified.
>
> There is one more issue that I need to solve (and that may be a problem in
> some cases with GDC too) - the pure, @safe and @nothrow attributes.
> Currently gcc builtin declarations in LDC have none of those attributes (I
> have to look into which of those can be added and if it can be done
> automatically). I've just commented out the attributes in my std.simd fork
> for now, but this isn't a proper solution.
>
>
>
>  That said, how did you come up with a lot of these implementations? Some
>> don't look particularly efficient, and others don't even look right.
>> xor for instance:
>> return cast(T) (cast(int4) v1 ^ cast(int4) v2);
>>
>> This is wrong for float types. x86 has separate instructions for doing
>> this
>> to floats, which make sure to do the right thing by the flags registers.
>> Most of the LDC blocks assume that it could be any architecture... I don't
>> think this will produce good portable code. It needs to be much more
>> cafully hand-crafted, but it's a nice working start.
>>
>
> The problem is that LLVM doesn't provide intrinsics for those operations.
> The xor function does compile to a single xorps instruction when compiling
> with -O1 or higher, though. I have looked at the code generated for many
> (most, I think, but not for all possible types) of those LDC blocks and
> most of them compile to the appropriate single instruction when compiled
> with -O2 or -O3. Even the ones for which the D source code looks horribly
> inefficient like for example loadUnaligned.
>
> By the way, clang does those in a similar way. For example, here is what
> clang emits for a wrapper around _mm_xor_ps when compiled with -O1
> -emit-llvm:
>
> define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind uwtable
> readnone {
>   %1 = bitcast <4 x float> %a to <4 x i32>
>   %2 = bitcast <4 x float> %b to <4 x i32>
>   %3 = xor <4 x i32> %1, %2
>   %4 = bitcast <4 x i32> %3 to <4 x float>
>   ret <4 x float> %4
> }
>
> AFAICT, the only way to ensure that a certain instruction will be used
> with LDC when there is no LLVM intrinsic for it is to use inline assembly
> expressions. I remember having some problems with those in the past, but it
> could be that I was doing something wrong. Maybe we should look into that
> option too.
>

Inline assembly usually ruins optimising (code reordering around inline asm
blocks is usually considered impossible).
It's interesting that the x86 codegen makes such good sense of those
sequences, but I'm rather more concerned about other platforms. I wonder if
other platforms have a similarly incomplete subset of intrinsics? :/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20121015/a3cdb6fd/attachment.html>


More information about the Digitalmars-d mailing list