[Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps
Cecil Ward
d at cecilward.com
Mon Mar 26 04:48:58 UTC 2018
On Friday, 23 March 2018 at 22:14:35 UTC, Iain Buclaw wrote:
> On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:
>> On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
>>> https://bugzilla.gdcproject.org/show_bug.cgi?id=288
>>>
>>> --- Comment #1 from Iain Buclaw <ibuclaw at gdcproject.org> ---
>>>> See the long list of useless conditional jumps towards the
>>>> end of the first function in the asm output (whose demangled
>>>> name is test.t1(unit))
>>>
>>> Well, you'd never use -O3 if you care about speed anyway. :-)
>>>
>>> And they are not useless jumps, it's just the foreach loop
>>> unrolled in its entirety. You can see that it's a feature of
>>> the gcc-7 series and latter, irregardless of the target, they
>>> all produce the same unrolled loop.
>>>
>>> https://explore.dgnu.org/g/vD3N4Y
>>>
>>> It might be a nice experiment to add pragma(ivdep) and
>>> pragma(unroll) support
>>> to give you more control.
>>>
>>> https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html
>>>
>>> I wouldn't hold my breath though (this is not strictly a bug).
>>
>> Agreed. It is possibly not a bug, because I don't see that the
>> code is dysfunctional, but I haven't looked through it. But
>> since the backend is doing optimisation here with unrolling,
>> that being sub-optimal given with this weird code is imho a
>> bug in that the achievement of _optimisation_ is not attained.
>>
>> No I understand this is nothing to do with D, and I understand
>> that this is unrolling.
>>
>> But notice the target of the jumps are all to the same
>> location and finishes off with an unconditional jump to the
>> same location.
>>
>
> Not quite, if you look a little closer, some jump to other
> branches hidden inbetween.
>
>> I feel this is just a quirk of unrolling, in part, but that's
>> not all I feel as the jumps don't make sense
>> cmp #n / jxx L3
>> cmp #m / jxx L3
>> jmp L3
>>
>> is what we have so it all basically does absolutely nothing,
>> unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of
>> testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very
>> funny.
>>
>
> If you compile with -fdump-tree-optimized=stdout you will see
> that it's the middle-end that has lowered the code to a series
> of if jumps.
>
> The backend consumer doesn't really have any chance for
> improving it.
>
>> I know this is merely debug-only code, but am wondering what
>> else might happen if you are misguided enough to use the crazy
>> -O3 with unrolled loops that have conditionals in them.
>>
>> My other complaint about GCC back-end’' code generation is
>> that it (sometimes) doesn't go for jump-less movcc-style
>> operations when it can. For x86/x64, LDC sometimes generates
>> jump-free code using conditional instructions where GCC does
>> not.
>>
>> I can fix the problem with GDC by using A single & instead of
>> a &&, which happens to be legal here. Sometimes in the past I
>> have needed to take steps to make sure that I can do such an
>> operator substitution trick in order to get jump-free far far
>> faster code, faster where the alternatives are extremely short
>> (and side-effect-free) and branch prediction failure is a
>> certainty.
>>
>
> You could also try compiling with -O2. I couldn't really see
> this in your given example, but honestly, if you want to
> optimize really aggressively you must be willing to coax the
> compiler in strange ways anyway.
>
>> I don't know if there are ways in which the backend could try
>> to ascertain whether the results of certain unrolling are
>> really bad. In some cases they could be bad because the code
>> is too long and generates problems with code cache size or
>> won't fit into a loop buffer. A highly per-cpu sub-variant
>> check would need to be carried out in the generated code size,
>> at least in all cases where there is still a loop left (as
>> opposed to full unrolling of known size), as every kind of AMD
>> and Intel processor is different, as Agner Fog warns us. Here
>> though I didn't even explicitly ask for unrolling, so you
>> might harshly say that it is the compiler’s jib ti work out
>> whether it is actually an anti-optimisation, regardless of the
>> possible reasons why the result may be bad news, never mind
>> just based on total generated code size not fitting into some
>> per-CP limit.
>>
>
> Well again, from past experience -O3 doesn't really care about
> code size or cache line so much. All optimizations passes
> which lower the code this way do so during SSA transformations,
> so irrespective of what is being targeted.
>
>> My reason for reporting this was to inquire about for loop
>> unrolling behaves in later versions of the back end, ask about
>> jump generation vs jump-free alternatives (LDC showing the
>> correct way to do things) and to ask if there are any
>> suboptimality nasties junking in code that does not merely
>> come down to driving an assert.
>>
>> I would hope for an optimisation that handles the case of
>> dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially
>> with no holes, in the case where _all the jumps go to the same
>> target_, so this can get reduced down into a two-test range
>> check and huge optimisation. I would also hope that
>> conditional jumps followed by an unconditional jump could be
>> spotted and handled too. (Peephole general low level
>> optimisation then? ie jxx L1 / jmp L1 = jmp L1)
>>
>
> But is it really sub-optimal what -O3 is doing? Have you
> benchmarked it? I certainly didn't.
>
> ---
> bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
> out ( ret )
> {
> assert( ret == true || ret == false );
> debug
> {
> bool b = false;
> foreach( s; 0.. 8 * x.sizeof )
> b = b || ( x == (1uL << s) );
> assert( ret == b );
> }
> }
> body
> {
> return ( ( x & (x - 1) ) == 0 ) & (x > 0);
> }
>
> bool t1( int x )
> {
> return IsPowerOf2( x );
> }
>
> void main(string[] args)
> {
> int test()
> {
> return t1(cast(int)args.length);
> }
>
> import std.datetime.stopwatch;
> import std.stdio;
>
> auto r = benchmark!(test)(10000000);
> writeln(r);
> }
> ---
>
> [54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --param
> max-peel-branches=32
> [168 ms, 71 μs, and 1 hnsec]: gdc -O3 -fdebug --param
> max-peel-branches=24
>
> Looks like that despite your complaint, it is 3x faster with
> what you call "strange nonsensical code generation". :-)
>
>> Perhaps this is all being generated too late and optimisations
>> have ready happened and they opportunities those optimisers
>> provide have been and gone. Would it be possible for the
>> backend to include a number of repeat optimisation passes of
>> certain kinds after unrolled code is generated, or doesn't it
>> work like that?
>>
>
> I had a quick look, and there is a control parameter for this -
> '--param max-peel-branches', the default value is 32, reducing
> it to 24 is enough to ensure that the complete unrolling never
> happens, but see benchmarks for why changing this may not be a
> good idea.
>
>> Anyway, this is not for you but for that particular backend, I
>> suspect. I was wondering if someone could have a word, pass it
>> in to the relevant people. I think it's worth making -O3 more
>> generally usable rather than crazy because it features good
>> ideas gone bad.
>
> No, as a language implementer, our job is only to guarantee
> that semantic never breaks. Anything else is not relevant to
> us.
>
> In this case though, it looks like everything is fine though
> and there's nothing to report.
Iain you and I are fully in agreement, but I was hoping some
feedback could be given to the relevant backend guys. Those runs
of multiple cmps could in part be reduced down to range checks,
and a conditional jump followed by an unconditional jump to the
same target is silly machine code generation.
Everything is indeed fine with GDC per-se, as you say. And in no
way was it meant to be a whinge about GDC. It's a tool I have a
lot of admiration for, and we are lucky two have _two_ high
quality code-generators for x64 for D source in GDC and LDC. The
backend implementors can maybe even keep an eye on one another’s
tricks perhaps.
When I used the terms optimal / suboptimal I was thinking like an
asm programmer, at the level of replacement of individual groups
of instructions and the removal of instructions that either (a)
literally have no useful reason to be there or (b) are very far
from what any sane human asm programmer would write.
This has not been intended as a comment about optimisation of an
actual _program_ - after all, the snippet of D code in question
is merely a debug-build only assert, so its very function is to
slow the program down :-) because it is just sanity checking a
return value. Do I certainly wouldn't have benchmarked it as it's
only debug-only code and in the release build it's entirely gone.
I understand your helpful point about the relevant
transformations already having been triggered at a level higher
up and so cannot later be undone. It must be then that there is
no later last chance peephole ultra-low level x64 optimiser that
takes care of anything that happens to comes out of the type cmp
#n / jcc label1 cmp #n+1 / jcc label1 where the immediate values
are dense since in this case dozen tests could be replaced by
two, and the jcond label1 / jmp label1 pair is of course always
daft when the two jump targets are identical and so that would be
an opportunity for a peephole optimiser. I don't know if any
backends do such a thing anyway under some circumstances.
I can't remember whether I spotted some jump targets were
different - good point you made, didn't mean to mislead you. I
mean that there were _runs_, adjacent groups of identical jump
targets with dense check values and then there is the finish to
it.
I asked about this because the GCC etc compilers normally do not
produce x64 code with the _latter_ idiom in my limited
experience. Perhaps this is only so because middle ends protect
all backend and backend do not need to repeatedly invent the
wheel by implementing those kind of machine instruction group
checks themselves internally. I once wrote some transformational
optimiser code analogous to this myself, not in a normal
programming language compiler. I wrote a very complex thing to
minimise XML once with a dozen layers of transformations. I found
that deciding how to choose the order of optimiser
transformations is a pain, and so I ended up doing the same
optimiser transformation in more than one layer, just in case an
intermediate optimisation produces output that contains a kind of
bad idiom that was caught and stripped out by a higher level
optimser, but at a lower layer it is now too late, so I catch it
below instead with a re-run.
One nuisance with the GCC family's code generation is that it is
more willing to use conditional jumps than LLVM is, from my brief
experience of looking at code quality with GDC vs LDC. I wish
there were a way of giving the compiler-stack a big hint to tell
it avoid branches like the absolute plague wherever possible. The
process of deciding which way to go must be very difficult and
complex. Somehow LDC seems to get lucky.
There is also the question of the difference between a & and a
&&. Changing the source code to a & from a && (which happens to
be legitimate in this particular case) makes the jumping go
away. There is something which I don't understand in the middle
of the compilers that has to either do things the same way or not
within the limits of per-language semantics about guarantees
(shortcut eval) of && and || not executing some of the
intermediate tests, but of course most of the time analysis could
show possibly show that there is no issue at all, and perhaps
evaluating everything (because the tests are safe) across all &&
||s does no harm in some particular special case despite what the
language guarantees. These compilers are of course good enough to
exploit conditional moves and sequences of arithmetic operations
to do ? : type operations jump-free, and I may have seen code
generated in some places that is jump-free and which uses AND and
OR instructions in conjunction with eg x86 SETcc instructions,
doing a truly excellent job of staying away from conditional
jumps that will overpower branch predictors and dominate
execution time in certain small tight loops.
Apologies for bothering you, anyway. It's just some feedback
aimed at other people, but I wouldn't have any idea where to go,
which is my ignorant fault.
Many thanks for all you do.
More information about the D.gnu
mailing list