[Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps

Fri Mar 23 00:39:13 UTC 2018

On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>
> --- Comment #1 from Iain Buclaw <ibuclaw at gdcproject.org> ---
>> See the long list of useless conditional jumps towards the end 
>> of the first function in the asm output (whose demangled name 
>> is test.t1(unit))
>
> Well, you'd never use -O3 if you care about speed anyway. :-)
>
> And they are not useless jumps, it's just the foreach loop 
> unrolled in its entirety.  You can see that it's a feature of 
> the gcc-7 series and latter, irregardless of the target, they 
> all produce the same unrolled loop.
>
> https://explore.dgnu.org/g/vD3N4Y
>
> It might be a nice experiment to add pragma(ivdep) and 
> pragma(unroll) support
> to give you more control.
>
> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>
> I wouldn't hold my breath though (this is not strictly a bug).

Agreed. It is possibly not a bug, because I don't see that the 
code is dysfunctional, but I haven't looked through it. But since 
the backend is doing optimisation here with unrolling, that being 
sub-optimal given with this weird code is imho a bug in that the 
achievement of _optimisation_ is not attained.

No I understand this is nothing to do with D, and I understand 
that this is unrolling.

But notice the target of the jumps are all to the same location 
and finishes off with an unconditional jump to the same location.

I feel this is just a quirk of unrolling, in part, but that's not 
all I feel as the jumps don't make sense
  cmp #n / jxx L3
cmp #m / jxx L3
  jmp L3

is what we have so it all basically does absolutely nothing, 
unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of 
testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very 
funny.

I know this is merely debug-only code, but am wondering what else 
might happen if you are misguided enough to use the crazy -O3 
with unrolled loops that have conditionals in them.

My other complaint about GCC back-end’' code generation is that 
it (sometimes) doesn't go for jump-less movcc-style operations 
when it can. For x86/x64, LDC sometimes generates jump-free code 
using conditional instructions where GCC does not.

I can fix the problem with GDC by using A single & instead of a 
&&, which happens to be legal here. Sometimes in the past I have 
needed to take steps to make sure that I can do such an operator 
substitution trick in order to get jump-free far far faster code, 
faster where the alternatives are extremely short (and 
side-effect-free) and branch prediction failure is a certainty.

I don't know if there are ways in which the backend could try to 
ascertain whether the results of certain unrolling are really 
bad. In some cases they could be bad because the code is too long 
and generates problems with code cache size or won't fit into a 
loop buffer. A highly per-cpu sub-variant check would need to be 
carried out in the generated code size, at least in all cases 
where there is still a loop left (as opposed to full unrolling of 
known size), as every kind of AMD and Intel processor is 
different, as Agner Fog warns us. Here though I didn't even 
explicitly ask for unrolling, so you might harshly say that it is 
the compiler’s jib ti work out whether it is actually an 
anti-optimisation, regardless of the possible reasons why the 
result may be bad news, never mind just based on total generated 
code size not fitting into some per-CP limit.

My reason for reporting this was to inquire about for loop 
unrolling behaves in later versions of the back end, ask about 
jump generation vs jump-free alternatives (LDC showing the 
correct way to do things) and to ask if there are any 
suboptimality nasties junking in code that does not merely come 
down to driving an assert.

I would hope for an optimisation that handles the case of 
dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially 
with no holes, in the case where _all the jumps go to the same 
target_, so this can get reduced down into a two-test range check 
and huge optimisation. I would also hope that conditional jumps 
followed by an unconditional jump could be spotted and handled 
too. (Peephole general low level optimisation then? ie jxx L1 / 
jmp L1 =  jmp L1)

Perhaps this is all being generated too late and optimisations 
have ready happened and they opportunities those optimisers 
provide have been and gone. Would it be possible for the backend 
to include a number of repeat optimisation passes of certain 
kinds after unrolled code is generated, or doesn't it work like 
that?

Anyway, this is not for you but for that particular backend, I 
suspect. I was wondering if someone could have a word, pass it in 
to the relevant people. I think it's worth making -O3 more 
generally usable rather than crazy because it features good ideas 
gone bad.