[Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps
Cecil Ward
d at cecilward.com
Fri Mar 23 00:39:13 UTC 2018
On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>
> --- Comment #1 from Iain Buclaw <ibuclaw at gdcproject.org> ---
>> See the long list of useless conditional jumps towards the end
>> of the first function in the asm output (whose demangled name
>> is test.t1(unit))
>
> Well, you'd never use -O3 if you care about speed anyway. :-)
>
> And they are not useless jumps, it's just the foreach loop
> unrolled in its entirety. You can see that it's a feature of
> the gcc-7 series and latter, irregardless of the target, they
> all produce the same unrolled loop.
>
> https://explore.dgnu.org/g/vD3N4Y
>
> It might be a nice experiment to add pragma(ivdep) and
> pragma(unroll) support
> to give you more control.
>
> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>
> I wouldn't hold my breath though (this is not strictly a bug).
Agreed. It is possibly not a bug, because I don't see that the
code is dysfunctional, but I haven't looked through it. But since
the backend is doing optimisation here with unrolling, that being
sub-optimal given with this weird code is imho a bug in that the
achievement of _optimisation_ is not attained.
No I understand this is nothing to do with D, and I understand
that this is unrolling.
But notice the target of the jumps are all to the same location
and finishes off with an unconditional jump to the same location.
I feel this is just a quirk of unrolling, in part, but that's not
all I feel as the jumps don't make sense
cmp #n / jxx L3
cmp #m / jxx L3
jmp L3
is what we have so it all basically does absolutely nothing,
unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of
testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very
funny.
I know this is merely debug-only code, but am wondering what else
might happen if you are misguided enough to use the crazy -O3
with unrolled loops that have conditionals in them.
My other complaint about GCC back-end’' code generation is that
it (sometimes) doesn't go for jump-less movcc-style operations
when it can. For x86/x64, LDC sometimes generates jump-free code
using conditional instructions where GCC does not.
I can fix the problem with GDC by using A single & instead of a
&&, which happens to be legal here. Sometimes in the past I have
needed to take steps to make sure that I can do such an operator
substitution trick in order to get jump-free far far faster code,
faster where the alternatives are extremely short (and
side-effect-free) and branch prediction failure is a certainty.
I don't know if there are ways in which the backend could try to
ascertain whether the results of certain unrolling are really
bad. In some cases they could be bad because the code is too long
and generates problems with code cache size or won't fit into a
loop buffer. A highly per-cpu sub-variant check would need to be
carried out in the generated code size, at least in all cases
where there is still a loop left (as opposed to full unrolling of
known size), as every kind of AMD and Intel processor is
different, as Agner Fog warns us. Here though I didn't even
explicitly ask for unrolling, so you might harshly say that it is
the compiler’s jib ti work out whether it is actually an
anti-optimisation, regardless of the possible reasons why the
result may be bad news, never mind just based on total generated
code size not fitting into some per-CP limit.
My reason for reporting this was to inquire about for loop
unrolling behaves in later versions of the back end, ask about
jump generation vs jump-free alternatives (LDC showing the
correct way to do things) and to ask if there are any
suboptimality nasties junking in code that does not merely come
down to driving an assert.
I would hope for an optimisation that handles the case of
dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially
with no holes, in the case where _all the jumps go to the same
target_, so this can get reduced down into a two-test range check
and huge optimisation. I would also hope that conditional jumps
followed by an unconditional jump could be spotted and handled
too. (Peephole general low level optimisation then? ie jxx L1 /
jmp L1 = jmp L1)
Perhaps this is all being generated too late and optimisations
have ready happened and they opportunities those optimisers
provide have been and gone. Would it be possible for the backend
to include a number of repeat optimisation passes of certain
kinds after unrolled code is generated, or doesn't it work like
that?
Anyway, this is not for you but for that particular backend, I
suspect. I was wondering if someone could have a word, pass it in
to the relevant people. I think it's worth making -O3 more
generally usable rather than crazy because it features good ideas
gone bad.
More information about the D.gnu
mailing list