[Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps

Fri Mar 23 22:14:35 UTC 2018

On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:
> On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
>> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>>
>> --- Comment #1 from Iain Buclaw <ibuclaw at gdcproject.org> ---
>>> See the long list of useless conditional jumps towards the 
>>> end of the first function in the asm output (whose demangled 
>>> name is test.t1(unit))
>>
>> Well, you'd never use -O3 if you care about speed anyway. :-)
>>
>> And they are not useless jumps, it's just the foreach loop 
>> unrolled in its entirety.  You can see that it's a feature of 
>> the gcc-7 series and latter, irregardless of the target, they 
>> all produce the same unrolled loop.
>>
>> https://explore.dgnu.org/g/vD3N4Y
>>
>> It might be a nice experiment to add pragma(ivdep) and 
>> pragma(unroll) support
>> to give you more control.
>>
>> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>>
>> I wouldn't hold my breath though (this is not strictly a bug).
>
> Agreed. It is possibly not a bug, because I don't see that the 
> code is dysfunctional, but I haven't looked through it. But 
> since the backend is doing optimisation here with unrolling, 
> that being sub-optimal given with this weird code is imho a bug 
> in that the achievement of _optimisation_ is not attained.
>
> No I understand this is nothing to do with D, and I understand 
> that this is unrolling.
>
> But notice the target of the jumps are all to the same location 
> and finishes off with an unconditional jump to the same 
> location.
>

Not quite, if you look a little closer, some jump to other 
branches hidden inbetween.

> I feel this is just a quirk of unrolling, in part, but that's 
> not all I feel as the jumps don't make sense
>  cmp #n / jxx L3
> cmp #m / jxx L3
>  jmp L3
>
> is what we have so it all basically does absolutely nothing, 
> unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of 
> testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very 
> funny.
>

If you compile with -fdump-tree-optimized=stdout you will see 
that it's the middle-end that has lowered the code to a series of 
if jumps.

The backend consumer doesn't really have any chance for improving 
it.

> I know this is merely debug-only code, but am wondering what 
> else might happen if you are misguided enough to use the crazy 
> -O3 with unrolled loops that have conditionals in them.
>
> My other complaint about GCC back-end’' code generation is that 
> it (sometimes) doesn't go for jump-less movcc-style operations 
> when it can. For x86/x64, LDC sometimes generates jump-free 
> code using conditional instructions where GCC does not.
>
> I can fix the problem with GDC by using A single & instead of a 
> &&, which happens to be legal here. Sometimes in the past I 
> have needed to take steps to make sure that I can do such an 
> operator substitution trick in order to get jump-free far far 
> faster code, faster where the alternatives are extremely short 
> (and side-effect-free) and branch prediction failure is a 
> certainty.
>

You could also try compiling with -O2.  I couldn't really see 
this in your given example, but honestly, if you want to optimize 
really aggressively you must be willing to coax the compiler in 
strange ways anyway.

> I don't know if there are ways in which the backend could try 
> to ascertain whether the results of certain unrolling are 
> really bad. In some cases they could be bad because the code is 
> too long and generates problems with code cache size or won't 
> fit into a loop buffer. A highly per-cpu sub-variant check 
> would need to be carried out in the generated code size, at 
> least in all cases where there is still a loop left (as opposed 
> to full unrolling of known size), as every kind of AMD and 
> Intel processor is different, as Agner Fog warns us. Here 
> though I didn't even explicitly ask for unrolling, so you might 
> harshly say that it is the compiler’s jib ti work out whether 
> it is actually an anti-optimisation, regardless of the possible 
> reasons why the result may be bad news, never mind just based 
> on total generated code size not fitting into some per-CP limit.
>

Well again, from past experience -O3 doesn't really care about 
code size or cache line so much.  All optimizations passes which 
lower the code this way do so during SSA transformations, so 
irrespective of what is being targeted.

> My reason for reporting this was to inquire about for loop 
> unrolling behaves in later versions of the back end, ask about 
> jump generation vs jump-free alternatives (LDC showing the 
> correct way to do things) and to ask if there are any 
> suboptimality nasties junking in code that does not merely come 
> down to driving an assert.
>
> I would hope for an optimisation that handles the case of 
> dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially 
> with no holes, in the case where _all the jumps go to the same 
> target_, so this can get reduced down into a two-test range 
> check and huge optimisation. I would also hope that conditional 
> jumps followed by an unconditional jump could be spotted and 
> handled too. (Peephole general low level optimisation then? ie 
> jxx L1 / jmp L1 =  jmp L1)
>

But is it really sub-optimal what -O3 is doing?  Have you 
benchmarked it?  I certainly didn't.

---
bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
out ( ret )
{
     assert( ret == true || ret == false );
     debug
     {
         bool b = false;
         foreach( s; 0.. 8 * x.sizeof )
             b = b || ( x == (1uL << s) );
         assert( ret == b );
     }
}
body
{
     return ( ( x & (x - 1) ) == 0 ) & (x > 0);
}

bool t1( int x )
{
     return IsPowerOf2( x );
}

void main(string[] args)
{
     int test()
     {
         return t1(cast(int)args.length);
     }

     import std.datetime.stopwatch;
     import std.stdio;

     auto r = benchmark!(test)(10000000);
     writeln(r);
}
---

[54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --param 
max-peel-branches=32
[168 ms, 71 μs, and 1 hnsec]:  gdc -O3 -fdebug --param 
max-peel-branches=24

Looks like that despite your complaint, it is 3x faster with what 
you call "strange nonsensical code generation". :-)

> Perhaps this is all being generated too late and optimisations 
> have ready happened and they opportunities those optimisers 
> provide have been and gone. Would it be possible for the 
> backend to include a number of repeat optimisation passes of 
> certain kinds after unrolled code is generated, or doesn't it 
> work like that?
>

I had a quick look, and there is a control parameter for this - 
'--param max-peel-branches', the default value is 32, reducing it 
to 24 is enough to ensure that the complete unrolling never 
happens, but see benchmarks for why changing this may not be a 
good idea.

> Anyway, this is not for you but for that particular backend, I 
> suspect. I was wondering if someone could have a word, pass it 
> in to the relevant people. I think it's worth making -O3 more 
> generally usable rather than crazy because it features good 
> ideas gone bad.

No, as a language implementer, our job is only to guarantee that 
semantic never breaks.  Anything else is not relevant to us.

In this case though, it looks like everything is fine though and 
there's nothing to report.