Any usable SIMD implementation?

Sun Apr 17 07:24:56 PDT 2016

Am Sat, 16 Apr 2016 21:46:08 -0700
schrieb Walter Bright <newshound2 at digitalmars.com>:

> On 4/16/2016 2:40 PM, Marco Leise wrote:
> > Tell me again, what's more elgant !  
> 
> If I wanted to write in assembler, I wouldn't write in a high level language, 
> especially a weird one like GNU version.

I hate the many pitfalls of extended asm: Forget to mention a
side effect in the "clobbers" list and the compiler assumes
that register or memory location still holds the value from
before the asm. Have an _input_ reg clobbered? Must NOT name
it in the clobber list but use it as a dummy output with a
dummy variable assignment. The learning curve is steep and as
you said, usually unintelligible without prior knowledge.

But what I really miss from the last generation of inline
assemblers are these points:

1. In most cases you can make the asm transparent to the
   optimizer leading to:
   1.a Inlining of asm
   1.b Dead-code removal of asm blocks

2. Asm Template arguments (e.g. input variables) are bound via
   constraints:
   2.a Can use output constraint `"=a" var` to mean an of "AL",
       "AX", "EAX" or "RAX" depending on size of 'var'
   2.b `"r" ptr` can bind 32-bit and 64-bit pointers often
       eliminating the need for duplicate asm blocks that only
       differ in one mention of e.g. RSI vs. ESI.
   2.c Compiler seamlessly integrates host code variables
       with asm with host code. No need to manually pick tmp
       registers to move parameters and output. `"r" myUint`
       is all it takes for 'myUint' to end up in any of EAX,
       EDX, ... (whatever the register allocator deems
       efficient at that point)
   3.d As a net result, asm templates often reduce to a single
       mnemonic and work with X86, X32 and AMD64.

3. In DMD I often see "naked" used to get rid of function
   prolog and epilog in an attempt to get an intrinsic-like,
   fast function. This requires extra care to get the calling
   convention right and may require more code duplication for
   e.g. Win32. Asm templates in GCC and LLVM benefit from this
   speedup automatically, because the backend will remove
   unneeded prolog/epilog code and even inline small functions.

GCC's historically grown template syntax based on multiple
_external_ assembler backends ain't that great and it is a
PITA that it cannot understand the mnemonics and figure out
side effects itself like DMD. But I hope I could highlight a
few points where classic assemblers as found in Delphi or DMD
fall behind in modern convenience and native efficiency.

When C was invented it matched the CPUs quite well, but today
we have dozens of instructions that C and D syntax has no
expression for. All modern compilers spend considerable amount
of backend code to the task of pattern matching code
constructs like a layman's POPCNT and replace them with
optimal CPU instructions. More and more we turn to browsing
the list of readily available compiler built-ins first and the
next step is to acknowledge the need and make inline
assemblers powerful enough for programmers to efficiently
implement non-existing intrinsics in library code.

-- 
Marco