More on vectorized comparisons

Sean Cavanaugh WorksOnMyMachine at gmail.com
Thu Aug 23 11:27:13 PDT 2012


On 8/22/2012 7:19 PM, bearophile wrote:
> Some time ago I have suggested to add support to vector comparisons in
> D, because this is sometimes useful and in the modern SIMD units there
> is hardware support for such operations:
>
>
> I think that code is semantically equivalent to:
>
> void main() {
>      double[] a = [1.0, 1.0, -1.0, 1.0, 0.0, -1.0];
>      double[] b = [10,   20,   30,  40,  50,   60];
>      double[] c = [1,     2,    3,   4,   5,    6];
>      foreach (i; 0 .. a.length)
>          if (a[i] > 0)
>              b[i] += c[i];
> }
>
>
> After that code b is:
> [11, 22, 30, 44, 50, 60]
>
>
> This means the contents of the 'then' branch of the vectorized
> comparison is done only on items of b and c where the comparison has
> given true.
>
> This looks useful. Is it possible to implement this in D, and do you
> like it?

Well, right now the binary operators == != >= <= > and < are required to 
return bool instead of allowing a user defined type, which prevents a 
lot of the sugar you would want to make the code nice to write.  Without 
the sugar the code would ends up this:

foreach(i; 0 .. a.length)
{
     float4 mask = greaterThan(a[i], float4(0,0,0,0));
     b[i] = select(mask, b[i] + c[i], b[i]);
}

in GPU shader land this expression is at least simpler to write:

foreach(i; 0 .. a.length)
{
     b[i] = (b[i] > 0) ? (b[i] + c[i]) : b[i];
}


All of these implementations are equivalent and remove the branch from 
the code flow, which is pretty nice for the CPU pipeline.   In SIMD the 
comparisons generate masks into a register which you can immediately 
use.  On modern (SSE4) CPUs the select is a single instruction, on older 
ones it takes three: (mask & A) | (~mask & B), but its all better than a 
real branch.

If you have a large amount of code needing a branch, you can take the 
mask generated by the compare, and extract it into a CPU register, and 
compare it for 0, nonzero, specific or any bits set.  a float4 
comparison ends up generating 4 bits, so the code with a real branch is 
like:

if (any(a[i] > 0))
{
     // do stuff if any of a[i] are greater than zero
}	
if (all(a[i] > 0))
{
     // do stuff if all of a[i] are greater than zero
}
if ((getMask(a[i] > 0) & 0x7) == 0x7)
{
     // do stuff if the first three elements are greater than zero
}




More information about the Digitalmars-d mailing list