gdc internals: weird way to translate "while" loops makes autovectorization impossible

Sat Feb 24 17:20:02 PST 2007

downs wrote:
> At the moment, gdc seems to be incapable of vectorizing simple loops.
> Example.
> The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse, generate SSE vectorized assembler code.
> This can be verified by compiling with -ftree-vectorizer-verbose=7.
>> #include "stdlib.h"
>>
>> int main() {
>>  float*ptr=malloc(sizeof(float)*8);
>>  float *tptr=ptr+8;
>>  while (ptr!=tptr) { *ptr=1.0; ptr++; } 
>>  return 0;
>> }
>>
> 
> I translated the program to D. Result:
> 
>> import std.c.stdlib : malloc;
>>
>> int main() {
>>  float*ptr=cast(float*)malloc(float.sizeof*8);
>>  float *tptr=ptr+8;
>>  while (ptr!=tptr) { *ptr=1.0; ptr++; } 
>>  return 0;
>> }
>>
> 
> Using the same flags, gdc doesn't vectorize the loop.
> I took the gdc vectorization process apart, added a few fprintf statements, and got the following log:
> test.d.t77.vect
> 
>> ;; Function main (_Dmain)
>>
>>
>> test.d:6: note: ===== analyze_loop_nest =====
>> test.d:6: note: === vect_analyze_loop_form ===
>> test.d:6: note: not vectorized: too many BBs in loop.
>>> I added that below.
>> dropping additional loop info ... 
>> ;;
>> ;; Loop 1:
>> ;;  header 4, latch 3
>> ;;  depth 1, level 1, outer 0
>> ;;  nodes: 4 3 2
>> dropping block info ... 
>>  Block 4
>> ;; basic block 4, loop depth 1, count 0
>> ;; prev block 3, next block 5
>> ;; pred:       3 [100.0%]  (fallthru,exec) 1 [100.0%]  (fallthru,exec)
>> ;; succ:       2 [100.0%]  (fallthru,exec)
>> # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>;
>> # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>;
>> # ptr_17 = PHI <ptr_9(3), ptr_3(1)>;
>> <L2>:;
>> #   TMT.5_13 = V_MAY_DEF <TMT.5_18>;
>> *ptr_17 = 1.0e+0;
>> ptr_9 = ptr_17 + 4;
>> goto <bb 2> (<L1>);
>>
>>  Block 3
>> ;; basic block 3, loop depth 1, count 0
>> ;; prev block 2, next block 4
>> ;; pred:       2 [88.9%]  (dfs_back,false,exec)
>> ;; succ:       4 [100.0%]  (fallthru,exec)
>> <L11>:;
>>
>>  Block 2
>> ;; basic block 2, loop depth 1, count 0
>> ;; prev block 1, next block 3
>> ;; pred:       4 [100.0%]  (fallthru,exec)
>> ;; succ:       5 [11.1%]  (loop_exit,true,exec) 3 [88.9%]  (dfs_back,false,exec)
>> <L1>:;
>> ivtmp.27_10 = ivtmp.27_1 - 1;
>> if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>;
>>
>> Done
>>> From here on out, it's normal gdc output.
>> test.d:6: note: bad loop form.
>> test.d:6: note: vectorized 0 loops in function.
>> main ()
>> {
>>   uintD.1160 ivtmp.27D.1244;
>>   floatD.1163 * ptrD.1186;
>>   floatD.1163 * tptrD.1189;
>>   intD.1177 D.1196;
>>   boolD.1173 D.1195;
>>   boolD.1173 D.1194;
>>   voidD.1154 * D.1191;
>>
>>   # BLOCK 0 freq:1111, starting at line 4
>>   # PRED: ENTRY
>>   [test.d : 4] D.1191_2 = [test.d : 4] malloc (32);
>>   [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2;
>>   [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32;
>>   [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>;
>>   # SUCC: 5 1
>>
>>   # BLOCK 1 freq:988
>>   # PRED: 0
>> <L9>:;
>>   goto <bb 4> (<L2>);
>>   # SUCC: 4
>>
>>   # BLOCK 2 freq:8889, starting at line 6
>>   # PRED: 4
>> [test.d : 6] <L1>:;
>>   ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1;
>>   [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>;
>>   # SUCC: 5 3
>>
>>   # BLOCK 3 freq:7901
>>   # PRED: 2
>> <L11>:;
>>   # SUCC: 4
>>
>>   # BLOCK 4 freq:8889, starting at line 6
>>   # PRED: 3 1
>>   # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>;
>>   # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>;
>> <L2>:;
>>   [test.d : 6] *ptrD.1186_17 = 1.0e+0;
>>   [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4;
>>   [test.d : 6] goto <bb 2> (<L1>);
>>   # SUCC: 2
>>
>>   # BLOCK 5 freq:1111
>>   # PRED: 2 0
>> <L3>:;
>>   return 0;
>>   # SUCC: EXIT
>>
>> }
> 
> Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop.
> See also, tree-vect-analyze.c:1861.
> I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please?
> Greetings .. downs
> 

Any news on this? I believe the fact that a rather important optimization feature of GCC is completely broken on GDC merits at least a brief response.
If you don't plan to fix this, then please, do tell me so.
If I made some kind of obvious, plain mistake, then please do tell me so.
But either way, I would really appreciate a response.

Oh, btw: the following code serves to illustrate the problem further.
 > import std.c.stdlib : malloc;
 >
 > int main() {
 >  float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8;
 >  version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; }
 >  else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) }
 >  return 0;
 > }
One of these versions will get autovectorized, the other not.
Anyway, greetings. --downs