Any usable SIMD implementation?

Tue Apr 12 09:53:16 PDT 2016

Am Mon, 11 Apr 2016 14:29:11 -0700
schrieb Walter Bright <newshound2 at digitalmars.com>:

> On 4/11/2016 7:24 AM, Marco Leise wrote:
> > Am Mon, 4 Apr 2016 11:43:58 -0700
> > schrieb Walter Bright <newshound2 at digitalmars.com>:
> >  
> >> On 4/4/2016 9:21 AM, Marco Leise wrote:  
> >>>     To put this to good use, we need a reliable way - basically
> >>>     a global variable - to check for SSE4 (or POPCNT, etc.). What
> >>>     we have now does not work across all compilers.  
> >>
> >> http://dlang.org/phobos/core_cpuid.html  
> >
> > That's what I implied in "what we have now":
> >
> > 	import core.cpuid;
> >
> > 	writeln( mmx );  // prints 'false' with GDC
> > 	version(InlineAsm_X86_Any)
> > 		writeln("DMD and LDC support the Dlang inline assembler");
> > 	else
> > 		writeln("GDC has the GCC extended inline assembler");  
> 
> There's no reason core.cpuid, which has a platform-independent API, cannot be 
> made to work with GDC and LDC. Adding more global variables to do the same thing 
> would add no value and would not be easier to implement.

LDC implements InlineAsm_X86_Any (DMD style asm), so
core.cpuid works. GDC is the only compiler that does not
implement it. We agree that core.cpuid should provide this
information, but what we have now - core.cpuid in a mix with
GDC's lack of DMD style asm - does not work in practice for
the years to come.

> > Both LLVM and GCC have moved to "extended inline assemblers"
> > that require you to provide information about input, output
> > and scratch registers as well as memory locations, so the
> > compiler can see through the asm-block for register allocation
> > and inlining purposes. It's more difficult to get right, but
> > also more rewarding, as it enables you to write no-overhead
> > "one-liners" and "intrinsics" while having calling conventions
> > still handled by the compiler.  
> 
> I know, but "more difficult" is a bit of an understatement. For example, 
> core.cpuid has not been implemented using those assemblers.

Yep, and that makes it unavailable in GDC. All feature tests
return false, even MMX or SSE2 on amd64.

> BTW, dmd's inline assembler does know about which instructions read/write which 
> registers, and makes use of that when inserting the code so it will work with 
> the rest of the code generator's register usage tracking.

That is a pleasant surprise. :)

> I find needing to tell gcc which registers are read/written by a particular 
> instruction to be a step BACKWARDS in usability. This is what computers are 
> supposed to be good for :-)

Still, DMD does not inline asm and always adds a function
prolog and epilog around asm blocks in an otherwise
empty function (correct me if I'm wrong). "naked" means you
have to duplicate code for the different calling conventions,
in particular Win32.

Your look on GCC (and LLVM) may be a bit biased. First of all
you don't need to tell it exactly which registers to use. A
rough classification is enough and gives the compiler a good
idea of where calculations should be stored upon arrival at
the asm statement. You can be specific down to the register
name or let the backend chose freely with "rm" (= any register
or memory).
An example: We have a variable x that is computed inside a
function followed by an asm block that multiplies it with
something else. Typically you would "MOV EAX, [x]" to load x
into the register that the MUL instruction expects. With
extended assemblers you can be declarative about that and just
state that x is needed in EAX as an input. You drop the MOV
from the asm block and let the compiler figure out in its
codegen, how x will end up in EAX. That's a step FORWARD in
usability.

> DMD doesn't inline functions with asm in them, but that is not the fault of the 
> inline assembler.
> 
> The only real weakness in the DMD inline assembler is it doesn't support "let 
> the compiler select the register". DMD's strong support for compiler builtins, 
> however, mitigate this to an acceptable level.

Yes, I've witnessed that in multiply with overflow check.
DMD generates very efficient code for 'mulu'. It's just that
the compiler cannot have builtins for everything. (I
personally was looking for 64-bit multiply with 128-bit
result and SSE4 string scanning.)
The extended assemblers in GCC and LLVM allow me to write
intrinsics, often as a single(!) instruction, that seamlessly
inlines into the surrounding code, just as DMD's builtins
would do.
And it seems to me we could have less backend complexity if we
were able to implement intrinsics as library code with the
same efficiency. ;) But most of the time when I want to access
a specialized CPU instruction for speed with asm in DMD, the
generic pure D code is faster. I would advise to only use it
if the concept is not expressible in pure D at the moment.
You might add that we shouldn't write asm in the first place,
because compilers have become smart enough, but it's not
like I was writing large chunks of asm. I use it to write
"compiler builtins" in D source code.

-- 
Marco