Adapting Tree Structures for Processing with SIMD,Instructions

Wed Sep 23 02:58:33 PDT 2015

Am Tue, 22 Sep 2015 16:36:40 +0000
schrieb Iakh <iaktakh at gmail.com>:

> On Tuesday, 22 September 2015 at 13:06:39 UTC, Andrei 
> Alexandrescu wrote:
> > A paper I found interesting: 
> > http://openproceedings.org/EDBT/2014/paper_107.pdf -- Andrei
> 
> __mm_movemask_epi a cornerstone of the topic currently not 
> implemented/not supported in D :(
> AFAIK it has irregular result format

Yes, it cannot be expressed in dmd's SIMD intrinsic "template"
so it is unsupported there. Before Walter went into
challenge-accepted mode towards LLVM and GDC, it was also not
really important to speed up algorithms with SIMD on it. You
would just be told to use one of the other compilers.

Manu's std.simd also doesn't attempt to support it, because he
was interested in unifying SIMD instructions available to all
architectures and movemask is somewhat x86 specific. (I asked
him about that instruction specifically a few years ago.)

That said, in my code for string operations I use the intrinsic
for GCC and LDC2 and fall back to emulated SIMD using uint or
ulong on DMD. Where movemask returns packed bits that you
might scan with 'bsf', in a ulong you usually end up with one
high bit per byte. If you call bsf() on it and divide the
result by 8 you have the byte index in the same way as with
bsf(movemask(...)).

On a related note, LLVM and GCC also offer extended inline
assemblers that are transparent to the optimizer. You just ask
for registers and/or stack memory to use and tell the compiler
what registers or memory locations will be overwritten. The
compiler can then hand you a few spare registers and knows
what registers it needs to save before the asm block. As a
result there are absolutely no seems where you placed your
asm, unlike earlier generations of inline asm.

-- 
Marco