Any usable SIMD implementation?

Tue Apr 12 16:29:38 PDT 2016

Am Tue, 12 Apr 2016 13:22:12 -0700
schrieb Walter Bright <newshound2 at digitalmars.com>:

> On 4/12/2016 9:53 AM, Marco Leise wrote:
> > LDC implements InlineAsm_X86_Any (DMD style asm), so
> > core.cpuid works. GDC is the only compiler that does not
> > implement it. We agree that core.cpuid should provide this
> > information, but what we have now - core.cpuid in a mix with
> > GDC's lack of DMD style asm - does not work in practice for
> > the years to come.  
> 
> Years? Anyone who needs core.cpuid could translate it to GDC's inline asm style 
> in an hour or so. It could even be simply written separately in GAS and linked 
> in. Since this has not been done, I can only conclude that core.cpuid has not 
> been an actual blocker.

You mean it is ok, if I duplicated most of the asm in there
and created a pull request ?

> > Still, DMD does not inline asm and always adds a function
> > prolog and epilog around asm blocks in an otherwise
> > empty function (correct me if I'm wrong).  
> 
> Not if you use "naked".
> 
> > "naked" means you
> > have to duplicate code for the different calling conventions,
> > in particular Win32.  
> 
> Why complain about it adding a prolog/epilog, and complain about it not adding it?

Yeah, I didn't make this clear. To reduce code repetition I'd
like to avoid "naked" and have the compiler handle the
calling conventions. Let's compare the earlier example in both
GDC and DMD in a coding style that is agnostic wrt. the
calling convention. First GDC:

  struct DblWord { ulong lo, hi; }
  DblWord bigMul(ulong x, ulong y)
  {
      DblWord tmp;
      asm {
          "mulq %[y]"
          : "=a" tmp.lo, "=d" tmp.hi : "a" x, [y] "rm" y;
      }
      return tmp;
  }

This is turned into the following instruction sequence (AT&T):

  mov    %rdi,%rax
  mul    %rsi
  retq

Note how elegantly GCC handles the calling convention for us.
The prolog reduces to moving 'x' from RDI to RAX where I asked
it to place it for the MUL to use as the implicit operand.
After multiplying it by the explicit operand in RSI, the
resulting two machine words would be in RAX:RDX as we know.
I created a data structure to return those two and told GCC to
tie tmp.lo to RAX and tmp.hi to RDX. Since the calling
convention happens to return structs of 2 machine words in
RAX:RDX, the whole assignment to 'tmp' and the return become
no-ops. With inlining enabled only the 'mul' would remain.
This is the ideal outcome. Now let's look at the DMD
implementation - again letting the compiler figure out the
calling convention:

  DblWord bigMul(ulong x, ulong y)
  {
      DblWord tmp;
      asm
      {
          mov RAX, x;
          mul y;
          mov tmp+DblWord.lo.offsetof, RAX;
          mov tmp+DblWord.hi.offsetof, RDX;
      }
      return tmp;
  }

This generates the following:

  push   %rbp
  mov    %rsp,%rbp
  sub    $0x20,%rsp
  mov    %rdi,-0x10(%rbp)
  mov    %rsi,-0x8(%rbp)
  lea    -0x20(%rbp),%rax
  xor    %ecx,%ecx
  mov    %rcx,(%rax)
  mov    %rcx,0x8(%rax)
  mov    -0x8(%rbp),%rax
  mulq   -0x10(%rbp)
  mov    %rax,-0x20(%rbp)
  mov    %rdx,-0x18(%rbp)
  mov    -0x18(%rbp),%rdx
  mov    -0x20(%rbp),%rax
  mov    %rbp,%rsp
  pop    %rbp
  retq

In practice GDC will just replace the invokation with a single
'mul' instruction while DMD will emit a call to this 18
instructions long function. Now you keep telling me extended
assembly is a step backwards. :)

> It's a step backwards because I can't just say "MUL EAX".

You could write this, you'd only have to tell the assembler
that EAX and EDX will be overwritten, something that DMD
already knows.

> I have to tell GCC what register the result gets put in.

And by doing this you allow it to figure out the shortest way
to return the result in compliance with the calling convention.

> This is, to my mind, ridiculous.

I too find it annoying that I have to inform it about the
scratch registers used in the asm, but the rest seems legit to
me. At some point you will have to connect variables in the
host language with registers in assembly. Doing this in a
declarative manner instead of explicit assembly code, allows
the backend to find the optimal code (literally) as demonstated
above.

> GCC's inline assembler apparently has no knowledge of what
> the opcodes actually do.

Agreed. It seems to treat the assembly text merely as a
text template. It is the same with LLVM's extended assembler
which borrows heavily from GCC's. This is probably due to the
fact that the assembler is historically a standalone
executable and as such the authority for interpreting the asm
code is outside of the scope of the host language compiler.
Under these circumstances we might have gone for the same
implementation.

-- 
Marco