LCD inline assembly expressions

Mon Dec 24 01:40:42 UTC 2018

On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
> On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
>> Is there any difference between using this vs the other method 
>> of doing intrinsics?
>
> Assuming there's really no LLVM intrinsic for your desired 
> instruction, the manual variant is what it is, a regular 
> function with an inline asm expression. I guess the LLVM 
> backends lower calls to these instruction-intrinsics directly 
> to inline asm expressions in the caller. With inlining, it 
> might result in equivalent final asm.
>
> My version above with the memory indirection isn't nice, this 
> is better:
>
> extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
>   return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", 
> a, b);
> }

so I had this..

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
   return __asm!__m128i("pcmpgtd $2,$1","=x,x,x",a,b);
}

Looked OK at first but it's actually wrong, the cmp instruction 
writes to $1 which is actually 'a', and it doesnt write anything 
to $0 which is the return, so it overwrites one of the inputs, 
and doesnt write the output. So it actualy needs to be this...

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
     return __asm!int4("
         movdqu $1,$0
         pcmpgtd $2,$0",
         "=x,x,x", a,b);
}

basically copy 'a' to the output, then do the compare with 'b' 
and the output

I dont think there's anyway to get around the temporary copy, 
since it depends on knowing if 'a' is ever use after its used in 
the compare. And it doesn't seem like the optimiser can cull it 
away in this case.