x86 intrinsics for sale cheap

Thu Jun 1 05:26:56 UTC 2023

On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:
> On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
>> On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
>>> On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
>>>> On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
>>>
>>
>> Ah, just followed that link. No that’s (solely?) SIMD, 
>> something I was aware of and so I’m not duplicating that as I 
>> haven’t gone near SIMD. The pext instruction would be one 
>> instruction that I attacked some time ago, and that would 
>> already be fine with ARM as there’s a pure D fallback, but 
>> maybe I can find some native ARM equivalent if I study AArch64.
>>
>> So no, this would be something new. Non-SIMD insns for general 
>> use. The smallest instructions might be something like andn if 
>> I can keep to zero-overhead obviously, seeing as the benefit 
>> in the instruction is so tiny anyway. But mind you I could 
>> have done with it for graphics bit twiddling manipulation code.
>
> If you tell LDC the right cpu target, and to use optimization, 
> IE..
>
> "-O -mcpu=haswell"
>
> It will use the andn instruction...
>
> uint foo(uint a, uint b)
> {
>     return a & (b ^ 0xFFFFFFFF);
> }
>
> compiles to ---->
>
> uint example.foo(uint, uint):
>         andn    eax, edi, esi
>         ret
>
> So you will probably find the compiler is already doing what 
> you want if you let it know it can target the right cpu 
> architechre.
>
> I've been writing asm for over 30 years, the opportunities for 
> beating modern compilers have gotten vanishingly small for 
> pretty much everything except for SIMD code. And tbh the 
> differences between CPUs, ie different instruction latency on 
> different architectures, means it's pretty much pointless to 
> chance few percent here or there, since there's a good chance 
> it'll be a few percent the other way on a different CPU.

I couldn’t agree more. I wrote asm full time for about five years 
at an operating systems outfit. But my aim was to just make these 
instructions available with zero overhead and then if I can 
somehow work out how to do it make them switch over to fallbacks 
in pure D _still with zero overhead for the test_ which I think 
is damn near impossible. And when I originally thought about 
andn, that would be the ultimate challenge because the benefit to 
be had is so very small that I would absolutely have to have to 
have zero overhead or it’s hopeless. So I wanted to see if I 
could get it to inline, checking the GDC and LDC compilers’ 
behaviour but I haven’t been able to test for inlining in call 
into an imported module from outside, from another .d file. I 
don’t have the tools, right now, long story. abut I will do 
something about that when I feel better, am quite unwell right 
now.

As for your insight into LDC and andn. Damn, I missed that. Many 
thanks for your help there. It’s not the first time I’ve seen 
this kind of excellent performance. I haven’t been using LDC 
enough because I am stuffed by the lack of support for