x86 intrinsics for sale cheap
Cecil Ward
cecil at cecilward.com
Thu Jun 1 05:26:56 UTC 2023
On Wednesday, 31 May 2023 at 23:18:44 UTC, claptrap wrote:
> On Wednesday, 31 May 2023 at 17:09:38 UTC, Cecil Ward wrote:
>> On Wednesday, 31 May 2023 at 16:51:42 UTC, Cecil Ward wrote:
>>> On Wednesday, 31 May 2023 at 16:45:35 UTC, max haughton wrote:
>>>> On Wednesday, 31 May 2023 at 16:33:47 UTC, Cecil Ward wrote:
>>>
>>
>> Ah, just followed that link. No that’s (solely?) SIMD,
>> something I was aware of and so I’m not duplicating that as I
>> haven’t gone near SIMD. The pext instruction would be one
>> instruction that I attacked some time ago, and that would
>> already be fine with ARM as there’s a pure D fallback, but
>> maybe I can find some native ARM equivalent if I study AArch64.
>>
>> So no, this would be something new. Non-SIMD insns for general
>> use. The smallest instructions might be something like andn if
>> I can keep to zero-overhead obviously, seeing as the benefit
>> in the instruction is so tiny anyway. But mind you I could
>> have done with it for graphics bit twiddling manipulation code.
>
> If you tell LDC the right cpu target, and to use optimization,
> IE..
>
> "-O -mcpu=haswell"
>
> It will use the andn instruction...
>
> uint foo(uint a, uint b)
> {
> return a & (b ^ 0xFFFFFFFF);
> }
>
> compiles to ---->
>
> uint example.foo(uint, uint):
> andn eax, edi, esi
> ret
>
> So you will probably find the compiler is already doing what
> you want if you let it know it can target the right cpu
> architechre.
>
> I've been writing asm for over 30 years, the opportunities for
> beating modern compilers have gotten vanishingly small for
> pretty much everything except for SIMD code. And tbh the
> differences between CPUs, ie different instruction latency on
> different architectures, means it's pretty much pointless to
> chance few percent here or there, since there's a good chance
> it'll be a few percent the other way on a different CPU.
I couldn’t agree more. I wrote asm full time for about five years
at an operating systems outfit. But my aim was to just make these
instructions available with zero overhead and then if I can
somehow work out how to do it make them switch over to fallbacks
in pure D _still with zero overhead for the test_ which I think
is damn near impossible. And when I originally thought about
andn, that would be the ultimate challenge because the benefit to
be had is so very small that I would absolutely have to have to
have zero overhead or it’s hopeless. So I wanted to see if I
could get it to inline, checking the GDC and LDC compilers’
behaviour but I haven’t been able to test for inlining in call
into an imported module from outside, from another .d file. I
don’t have the tools, right now, long story. abut I will do
something about that when I feel better, am quite unwell right
now.
As for your insight into LDC and andn. Damn, I missed that. Many
thanks for your help there. It’s not the first time I’ve seen
this kind of excellent performance. I haven’t been using LDC
enough because I am stuffed by the lack of support for
More information about the Digitalmars-d
mailing list