Optimization problem: bulk Boolean operations on vectors

Fri Dec 23 17:06:54 PST 2016

On Friday, 23 December 2016 at 22:11:31 UTC, Walter Bright wrote:
> On 12/23/2016 10:03 AM, hardreset wrote:
>
> For this D code:
>
> enum SIZE = 100000000;
>
> void foo(int* a, int* b) {
>     int* atop = a + 1000;
>     ptrdiff_t offset = b - a;
>     for (; a < atop; ++a)
> 	*a &= *(a + offset);
> }
>
> The following asm is generated by DMD:
>
>                 push    EBX
>                 mov     EBX,8[ESP]
>                 sub     EAX,EBX
>                 push    ESI
>                 cdq
>                 and     EDX,3
>                 add     EAX,EDX
>                 sar     EAX,2
>                 lea     ECX,0FA0h[EBX]
>                 mov     ESI,EAX
>                 cmp     EBX,ECX
>                 jae     L2C
> L20:            mov     EDX,[ESI*4][EBX]
>                 and     [EBX],EDX
>                 add     EBX,4
>                 cmp     EBX,ECX
>                 jb      L20
> L2C:            pop     ESI
>                 pop     EBX
>                 ret     4
>
> The inner loop is 5 instructions, whereas the one you wrote is 
> 7 instructions (I didn't benchmark it). With some more source 
> code manipulation the divide can be eliminated, but that is 
> irrelevant to the inner loop.

I patched up the prolog code and timed it and it came out 
identical to my asm. I tried the ptrdiff C-like code and that 
still comes out 20% slower here. I'm compiling with...

rdmd test.d -O -release -inline

Am I missing something? How do I get the asm output?