Any usable SIMD implementation?

Mon Apr 11 14:29:11 PDT 2016

On 4/11/2016 7:24 AM, Marco Leise wrote:
> Am Mon, 4 Apr 2016 11:43:58 -0700
> schrieb Walter Bright <newshound2 at digitalmars.com>:
>
>> On 4/4/2016 9:21 AM, Marco Leise wrote:
>>>     To put this to good use, we need a reliable way - basically
>>>     a global variable - to check for SSE4 (or POPCNT, etc.). What
>>>     we have now does not work across all compilers.
>>
>> http://dlang.org/phobos/core_cpuid.html
>
> That's what I implied in "what we have now":
>
> 	import core.cpuid;
>
> 	writeln( mmx );  // prints 'false' with GDC
> 	version(InlineAsm_X86_Any)
> 		writeln("DMD and LDC support the Dlang inline assembler");
> 	else
> 		writeln("GDC has the GCC extended inline assembler");

There's no reason core.cpuid, which has a platform-independent API, cannot be 
made to work with GDC and LDC. Adding more global variables to do the same thing 
would add no value and would not be easier to implement.

> Both LLVM and GCC have moved to "extended inline assemblers"
> that require you to provide information about input, output
> and scratch registers as well as memory locations, so the
> compiler can see through the asm-block for register allocation
> and inlining purposes. It's more difficult to get right, but
> also more rewarding, as it enables you to write no-overhead
> "one-liners" and "intrinsics" while having calling conventions
> still handled by the compiler.

I know, but "more difficult" is a bit of an understatement. For example, 
core.cpuid has not been implemented using those assemblers.

BTW, dmd's inline assembler does know about which instructions read/write which 
registers, and makes use of that when inserting the code so it will work with 
the rest of the code generator's register usage tracking.

I find needing to tell gcc which registers are read/written by a particular 
instruction to be a step BACKWARDS in usability. This is what computers are 
supposed to be good for :-)

> An example for GDC:
>
> 	struct DblWord { ulong lo, hi; }
>
> 	/// Multiplies two machine words and returns a double
> 	/// machine word.
> 	DblWord bigMul(ulong x, ulong y)
> 	{
> 		DblWord tmp = void;
> 		// '=a' and '=d' are outputs to RAX and RDX
> 		// respectively that are bound to the two
> 		// fields of 'tmp'.
> 		// '"a" x' means that we want 'x' as input in
> 		// RAX and '"rm" y' places 'y' wherever it
> 		// suits the compiler (any general purpose
> 		// register or memory location).
> 		// 'mulq %3' multiplies with the ulong
> 		// represented by the argument at index 3 (y).
> 		asm {
> 			"mulq %3"
> 			 : "=a" tmp.lo, "=d" tmp.hi
> 			 : "a" x, "rm" y;
> 		}
> 		return tmp;
> 	}
>
> In the above example the compiler has enough information to
> inline the function or directly return the result in RAX:RDX
> without writing to memory first. The same thing in DMD would
> likely have turned out slower than emulating this using
> several uint->ulong multiplies.

DMD doesn't inline functions with asm in them, but that is not the fault of the 
inline assembler.

The only real weakness in the DMD inline assembler is it doesn't support "let 
the compiler select the register". DMD's strong support for compiler builtins, 
however, mitigate this to an acceptable level.

> Although less powerful, the LDC team implemented Dlang inline
> assembly according to the specs and so core.cpuid works for
> them. GDC on the other hand is out of the picture until either
> 1) GDC adds Dlang inline assembly
> 2) core.cpuid duplicates most of its assembly code to support
>     the GCC extended inline assembler
>
> I would prefer a common extended inline assembler though,
> because when you use it for performance reasons you typically
> cannot go with non-inlinable Dlang asm, so you end up with pure
> D for DMD, GCC asm for GDC and LDC asm - three code paths.
>