gdc internals: weird way to translate "while" loops makes autovectorization impossible
Downs
default_357-line at yahoo.de
Sat Feb 24 17:20:02 PST 2007
downs wrote:
> At the moment, gdc seems to be incapable of vectorizing simple loops.
> Example.
> The following C program will, if compiled with gcc -O3 -ftree-vectorize -msse, generate SSE vectorized assembler code.
> This can be verified by compiling with -ftree-vectorizer-verbose=7.
>> #include "stdlib.h"
>>
>> int main() {
>> float*ptr=malloc(sizeof(float)*8);
>> float *tptr=ptr+8;
>> while (ptr!=tptr) { *ptr=1.0; ptr++; }
>> return 0;
>> }
>>
>
> I translated the program to D. Result:
>
>> import std.c.stdlib : malloc;
>>
>> int main() {
>> float*ptr=cast(float*)malloc(float.sizeof*8);
>> float *tptr=ptr+8;
>> while (ptr!=tptr) { *ptr=1.0; ptr++; }
>> return 0;
>> }
>>
>
> Using the same flags, gdc doesn't vectorize the loop.
> I took the gdc vectorization process apart, added a few fprintf statements, and got the following log:
> test.d.t77.vect
>
>> ;; Function main (_Dmain)
>>
>>
>> test.d:6: note: ===== analyze_loop_nest =====
>> test.d:6: note: === vect_analyze_loop_form ===
>> test.d:6: note: not vectorized: too many BBs in loop.
>>> I added that below.
>> dropping additional loop info ...
>> ;;
>> ;; Loop 1:
>> ;; header 4, latch 3
>> ;; depth 1, level 1, outer 0
>> ;; nodes: 4 3 2
>> dropping block info ...
>> Block 4
>> ;; basic block 4, loop depth 1, count 0
>> ;; prev block 3, next block 5
>> ;; pred: 3 [100.0%] (fallthru,exec) 1 [100.0%] (fallthru,exec)
>> ;; succ: 2 [100.0%] (fallthru,exec)
>> # ivtmp.27_1 = PHI <ivtmp.27_10(3), 8(1)>;
>> # TMT.5_18 = PHI <TMT.5_13(3), TMT.5_12(1)>;
>> # ptr_17 = PHI <ptr_9(3), ptr_3(1)>;
>> <L2>:;
>> # TMT.5_13 = V_MAY_DEF <TMT.5_18>;
>> *ptr_17 = 1.0e+0;
>> ptr_9 = ptr_17 + 4;
>> goto <bb 2> (<L1>);
>>
>> Block 3
>> ;; basic block 3, loop depth 1, count 0
>> ;; prev block 2, next block 4
>> ;; pred: 2 [88.9%] (dfs_back,false,exec)
>> ;; succ: 4 [100.0%] (fallthru,exec)
>> <L11>:;
>>
>> Block 2
>> ;; basic block 2, loop depth 1, count 0
>> ;; prev block 1, next block 3
>> ;; pred: 4 [100.0%] (fallthru,exec)
>> ;; succ: 5 [11.1%] (loop_exit,true,exec) 3 [88.9%] (dfs_back,false,exec)
>> <L1>:;
>> ivtmp.27_10 = ivtmp.27_1 - 1;
>> if (ivtmp.27_10 == 0) goto <L3>; else goto <L11>;
>>
>> Done
>>> From here on out, it's normal gdc output.
>> test.d:6: note: bad loop form.
>> test.d:6: note: vectorized 0 loops in function.
>> main ()
>> {
>> uintD.1160 ivtmp.27D.1244;
>> floatD.1163 * ptrD.1186;
>> floatD.1163 * tptrD.1189;
>> intD.1177 D.1196;
>> boolD.1173 D.1195;
>> boolD.1173 D.1194;
>> voidD.1154 * D.1191;
>>
>> # BLOCK 0 freq:1111, starting at line 4
>> # PRED: ENTRY
>> [test.d : 4] D.1191_2 = [test.d : 4] malloc (32);
>> [test.d : 4] ptrD.1186_3 = (floatD.1163 *) D.1191_2;
>> [test.d : 5] tptrD.1189_4 = ptrD.1186_3 + 32;
>> [test.d : 6] if (ptrD.1186_3 == tptrD.1189_4) [test.d : 6] goto <L3>; else goto <L9>;
>> # SUCC: 5 1
>>
>> # BLOCK 1 freq:988
>> # PRED: 0
>> <L9>:;
>> goto <bb 4> (<L2>);
>> # SUCC: 4
>>
>> # BLOCK 2 freq:8889, starting at line 6
>> # PRED: 4
>> [test.d : 6] <L1>:;
>> ivtmp.27D.1244_10 = ivtmp.27D.1244_1 - 1;
>> [test.d : 6] if (ivtmp.27D.1244_10 == 0) [test.d : 6] goto <L3>; else goto <L11>;
>> # SUCC: 5 3
>>
>> # BLOCK 3 freq:7901
>> # PRED: 2
>> <L11>:;
>> # SUCC: 4
>>
>> # BLOCK 4 freq:8889, starting at line 6
>> # PRED: 3 1
>> # ivtmp.27D.1244_1 = PHI <ivtmp.27D.1244_10(3), 8(1)>;
>> # ptrD.1186_17 = PHI <ptrD.1186_9(3), ptrD.1186_3(1)>;
>> <L2>:;
>> [test.d : 6] *ptrD.1186_17 = 1.0e+0;
>> [test.d : 6] ptrD.1186_9 = ptrD.1186_17 + 4;
>> [test.d : 6] goto <bb 2> (<L1>);
>> # SUCC: 2
>>
>> # BLOCK 5 freq:1111
>> # PRED: 2 0
>> <L3>:;
>> return 0;
>> # SUCC: EXIT
>>
>> }
>
> Note the block 3, which only consists of a label and nothing more. Apparently, having three BBs instead of two blocks gdc from optimizing the loop.
> See also, tree-vect-analyze.c:1861.
> I don't know enough about gdc to see where that empty block is generated, but could we get rid of it, possibly, please?
> Greetings .. downs
>
Any news on this? I believe the fact that a rather important optimization feature of GCC is completely broken on GDC merits at least a brief response.
If you don't plan to fix this, then please, do tell me so.
If I made some kind of obvious, plain mistake, then please do tell me so.
But either way, I would really appreciate a response.
Oh, btw: the following code serves to illustrate the problem further.
> import std.c.stdlib : malloc;
>
> int main() {
> float*ptr=cast(float*)malloc(float.sizeof*8); float*tptr=ptr+8;
> version(spaghetti) { loop: *ptr=1.0; ptr++; if (ptr!=tptr) goto loop; }
> else { do { *ptr=1.0; ptr++; } while (ptr!=tptr) }
> return 0;
> }
One of these versions will get autovectorized, the other not.
Anyway, greetings. --downs
More information about the D.gnu
mailing list