Benchmark memchar (with GCC builtins)

Fri Oct 30 14:29:46 PDT 2015

I continue to play with SIMD. So I was trying to use std.simd
But it has lots of thing to be implemented. And I also gave up 
with
  core.simd.__simd due to problems with PMOVMSKB instruction (it 
is not implemented).

Today I was playing with memchr for gdc:
memchr: http://www.cplusplus.com/reference/cstring/memchr/
My implementations with benchmark:
http://dpaste.dzfl.pl/4c46c0cf340c

Benchmark results:
-----
Naive:        21.9 	TickDuration(136456491)
SIMD:         3.04 	TickDuration(18920182)
SIMDM:        2.44 	TickDuration(15232176)
SIMDU:         1.8 	TickDuration(11210454)
C:               1 	TickDuration(6233963)

Mid colon is duration relative to C implementation 
(core.stdc.string).

memchrSIMD splits an input into three parts: unaligned begin, 
unaligned end, and aligned mid.

memchrSIMDM instead of pmovmskb use this code:
------
         if (Mask mask = *cast(Mask*)(result.array.ptr))
         {
             return ptr + bsf(mask) / BitsInByte;
         }
         else if (Mask mask = *cast(Mask*)(result.array.ptr + 
Mask.sizeof))
         {
             return ptr + bsf(mask) / BitsInByte + 
cast(int)Mask.sizeof;
         }
------

memchrSIMDU (unaligned) applay SIMD instructions from first array 
elements

SIMD part of function:
------
         ubyte16 niddles;
         niddles.ptr[0..16] = value;
         ubyte16 result;
         ubyte16 arr;

         for (; ptr < alignedEnd; ptr += 16)
         {
             arr.ptr[0..16] = ptr[0..16];
             result = __builtin_ia32_pcmpeqb128(arr, niddles);
             int i = __builtin_ia32_pmovmskb128(result);
             if (i != 0)
             {
                 return ptr + bsf(i);
             }
         }
------