Standard way to supply hints to branches

Wed Sep 11 22:23:52 UTC 2024

On Wed, 11 Sept 2024 at 18:26, Walter Bright via Digitalmars-d <
digitalmars-d at puremagic.com> wrote:

> On 9/11/2024 4:44 AM, Manu wrote:
> > Okay, I see. You're depending on the optimiser to specifically collapse
> the goto
> > into the branch as a simplification.
>
> Actually, the same code is generated without optimization. All it's doing
> is
> removing blocks that consist of nothing but "goto". It's a trivial
> optimization,
> and was there in the earliest version of the compiler.
>
>
> > Surely that's not even remotely reliable. There are several ways to
> optimise
> > that function, and I see no reason an optimiser would reliably choose a
> > construct like you show.
>
> gcc -O does more or less the same thing.
>
>
> > I'm actually a little surprised; a lifetime of experience with this sort
> of
> > thing might have lead me to predict that the optimiser would /actually/
> shift
> > the `return 0` up into the place of the goto, effectively eliminating
> the
> > goto... I'm sure I've seen optimisers do that transformation before, but
> I can't
> > recall ever noting an instance of code generation that looks like what
> you
> > pasted... I reckon I might have spotted that before.
>
> The goto remains in the gcc -O version.
>
>
> > ... and turns out, I'm right. I was so surprised with the codegen you
> present
> > that I pulled out compiler explorer and ran some experiments.
> > I tested GCC and Clang for x86, MIPS, and PPC, all of which I am
> extremely
> > familiar with, and all of them optimise the way I predicted. None of
> them showed
> > a pattern like you presented here.
>
> gcc -O produced:
>
> ```
> foo:
>      mov       EAX,0
>      test      EDI,EDI
>      jne       L1B
>      sub       RSP,8
>      call      bar at PC32
>      mov       EAX,1
>      add       RSP,8
> L1B:    rep
>      ret
> baz:
>      mov       EAX,0
>      test      EDI,EDI
>      jne       L38
>      sub       RSP,8
>      call      bar at PC32
>      mov       EAX,1
>      add       RSP,8
> L38:    rep
>      ret
> ```
>
> > If I had to guess; I would actually imagine that GCC and Clang will very
> > deliberately NOT make a transformation like the one you show, for the
> precise
> > reason that such a transformation changes the nature of static branch
> prediction
> > which someone might have written code to rely on. It would be dangerous
> for the
> > optimiser to transform the code in the way you show, and so it doesn't.
>
> The transformation is (intermediate code):
> ```
> if (i) goto L2; else goto L4;
> L2:
>     goto L3;
> L4:
>     bar();
>     return 1;
> L3:
>     return 0;
> ```
> becomes:
> ```
> if (!i) goto L3; else goto L4;
> L4:
>      bar();
>      return 1;
> L3:
>      return 0;
> ```
> I.e. the goto->goto was replaced with a single goto.
>
> It's not dangerous or weird at all, nor does it interfere with branch
> prediction.
>

It inverts the condition. In the case on trial, that inverts the branch
prediction.

But that aside, I'm even more confused; I couldn't reproduce that in any of
my tests.
Here's a bunch of my test copiles... they all turn out the same:

gcc:

baz(int):
        test    edi, edi
        je      .L10
        xor     eax, eax
        ret
.L10:
        sub     rsp, 8
        call    bar()
        mov     eax, 1
        add     rsp, 8
        ret

clang:

baz(int):
        xor     eax, eax
        test    edi, edi
        je      .LBB0_1
        ret
.LBB0_1:
        push    rax
        call    bar()@PLT
        mov     eax, 1
        add     rsp, 8
        ret

gcc-powerpc:

baz(int):
        cmpwi 0,3,0
        beq- 0,.L9
        li 3,0
        blr
.L9:
        stwu 1,-16(1)
        mflr 0
        stw 0,20(1)
        bl bar()
        lwz 0,20(1)
        li 3,1
        addi 1,1,16
        mtlr 0
        blr

arm64:

baz(int):
        cbz     w0, .L9
        mov     w0, 0
        ret
.L9:
        stp     x29, x30, [sp, -16]!
        mov     x29, sp
        bl      bar()
        mov     w0, 1
        ldp     x29, x30, [sp], 16
        ret

clang-mips:

baz(int):
        beqz    $4, $BB0_2
        addiu   $2, $zero, 0
        jr      $ra
        nop
$BB0_2:
        addiu   $sp, $sp, -24
        sw      $ra, 20($sp)
        sw      $fp, 16($sp)
        move    $fp, $sp
        jal     bar()
        nop
        addiu   $2, $zero, 1
        move    $sp, $fp
        lw      $fp, 16($sp)
        lw      $ra, 20($sp)
        jr      $ra
        addiu   $sp, $sp, 24

Even if you can manage to convince a compiler to write the output you're
alleging, I would never imagine for a second that's a reliable strategy.
The optimiser could do all kinds of things... even though in all my
experiments, it does exactly what I predicted it would.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20240911/db7a05be/attachment-0001.htm>