gdc internals: weird way to translate "while" loops makes autovectorization impossible

Sun Feb 25 19:26:50 PST 2007

Downs wrote:
> downs wrote:
> 
>> At the moment, gdc seems to be incapable of vectorizing simple loops.
>> Example.
>> The following C program will, if compiled with gcc -O3 
>> -ftree-vectorize -msse, generate SSE vectorized assembler code.
>> This can be verified by compiling with -ftree-vectorizer-verbose=7.
>>
>>> #include "stdlib.h"
>>>
>>> int main() {
>>>  float*ptr=malloc(sizeof(float)*8);
>>>  float *tptr=ptr+8;
>>>  while (ptr!=tptr) { *ptr=1.0; ptr++; }  return 0;
>>> }
>>>
>>
>> I translated the program to D. Result:
>>
>>> import std.c.stdlib : malloc;
>>>
>>> int main() {
>>>  float*ptr=cast(float*)malloc(float.sizeof*8);
>>>  float *tptr=ptr+8;
>>>  while (ptr!=tptr) { *ptr=1.0; ptr++; }  return 0;
>>> }
>>>
>>
>> Using the same flags, gdc doesn't vectorize the loop.
>> I took the gdc vectorization process apart, added a few fprintf 
>> statements, and got the following log:
>> test.d.t77.vect
>>
>>> ;; Function main (_Dmain)
>>>
>>>
>>> test.d:6: note: ===== analyze_loop_nest =====
>>> test.d:6: note: === vect_analyze_loop_form ===
>>> test.d:6: note: not vectorized: too many BBs in loop.
>>>
>>>> I added that below.
>>>
>>> dropping additional loop info ... ;;
>>> ;; Loop 1:
>>> ;;  header 4, latch 3
>>> ;;  depth 1, level 1, outer 0
>>> ;;  nodes: 4 3 2
>>> dropping block info ...  Block 4
>>> ;; basic block 4, loop depth 1, count 0
>>> ;; prev block 3, next block 5
>>> ;; pred:       3 [100.0%]  (fallthru,exec) 1 [100.0%]  (fallthru,exec)
>>> ;; succ:       2 [100.0%]  (fallthru,exec)
>>> # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>;
>>> # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>;
>>> # ptr_17 = PHI <ptr_9(3), ptr_3(1)>;
>>> <L2>:;
>>> #   TMT.5_13 = V_MAY_DEF <TMT.5_18>;
>>> *ptr_17 = 1.0e+0;
>>> ptr_9 = ptr_17 + 4;
>>> goto <bb 2> (<L1>);
>>>
>>>  Block 3
>>> ;; basic block 3, loop depth 1, count 0
>>> ;; prev block 2, next block 4
>>> ;; pred:       2 [88.9%]  (dfs_back,false,exec)
>>> ;; succ:       4 [100.0%]  (fallthru,exec)
>>> <L11>:;
>>>
>>>  Block 2
>>> ;; basic block 2, loop depth 1, count 0
>>> ;; prev block 1, next block 3
>>> ;; pred:       4 [100.0%]  (fallthru,exec)
>>> ;; succ:       5 [11.1%]  (loop_exit,true,exec) 3 [88.9%]  
>>> (dfs_back,false,exec)
>>> <L1>:;
>>> ivtmp.27_10 = ivtmp.27_1 - 1;
>>> if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>;
>>>
>>> Done
>>>
>>>> From here on out, it's normal gdc output.
>>>
>>> test.d:6: note: bad loop form.
>>> test.d:6: note: vectorized 0 loops in function.
>>> main ()
>>> {
>>>   uintD.1160 ivtmp.27D.1244;
>>>   floatD.1163 * ptrD.1186;
>>>   floatD.1163 * tptrD.1189;
>>>   intD.1177 D.1196;
>>>   boolD.1173 D.1195;
>>>   boolD.1173 D.1194;
>>>   voidD.1154 * D.1191;
>>>
>>>   # BLOCK 0 freq:1111, starting at line 4
>>>   # PRED: ENTRY
>>>   [test.d : 4] D.1191_2 = [test.d : 4] malloc (32);
>>>   [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2;
>>>   [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32;
>>>   [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto 
>>> <L3>; else goto <L9>;
>>>   # SUCC: 5 1
>>>
>>>   # BLOCK 1 freq:988
>>>   # PRED: 0
>>> <L9>:;
>>>   goto <bb 4> (<L2>);
>>>   # SUCC: 4
>>>
>>>   # BLOCK 2 freq:8889, starting at line 6
>>>   # PRED: 4
>>> [test.d : 6] <L1>:;
>>>   ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1;
>>>   [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; 
>>> else goto <L11>;
>>>   # SUCC: 5 3
>>>
>>>   # BLOCK 3 freq:7901
>>>   # PRED: 2
>>> <L11>:;
>>>   # SUCC: 4
>>>
>>>   # BLOCK 4 freq:8889, starting at line 6
>>>   # PRED: 3 1
>>>   # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>;
>>>   # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>;
>>> <L2>:;
>>>   [test.d : 6] *ptrD.1186_17 = 1.0e+0;
>>>   [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4;
>>>   [test.d : 6] goto <bb 2> (<L1>);
>>>   # SUCC: 2
>>>
>>>   # BLOCK 5 freq:1111
>>>   # PRED: 2 0
>>> <L3>:;
>>>   return 0;
>>>   # SUCC: EXIT
>>>
>>> }
>>
>>
>> Note the block 3, which only consists of a label and nothing more. 
>> Apparently, having three BBs instead of two blocks gdc from optimizing 
>> the loop.
>> See also, tree-vect-analyze.c:1861.
>> I don't know enough about gdc to see where that empty block is 
>> generated, but could we get rid of it, possibly, please?
>> Greetings .. downs
>>
> 
> Any news on this? I believe the fact that a rather important 
> optimization feature of GCC is completely broken on GDC merits at least 
> a brief response.
> If you don't plan to fix this, then please, do tell me so.
> If I made some kind of obvious, plain mistake, then please do tell me so.
> But either way, I would really appreciate a response.
> 
> 
> Oh, btw: the following code serves to illustrate the problem further.
>  > import std.c.stdlib : malloc;
>  >
>  > int main() {
>  >  float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8;
>  >  version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; }
>  >  else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) }
>  >  return 0;
>  > }
> One of these versions will get autovectorized, the other not.
> Anyway, greetings. --downs

I will fix this, but I am not doing any work on it presently.  My best 
guess so far is that the C compiler puts loop conditions at the end of 
the loop to reduce the number of basic blocks (but has an extra initial 
goto.)  This is the form that the vectorizer expects.

David