goto [variable], and address of code labels

Manu turkeyman at gmail.com
Fri Oct 28 16:15:25 PDT 2011


On 28 October 2011 22:30, Dmitry Olshansky <dmitry.olsh at gmail.com> wrote:

> On 28.10.2011 22:57, ponce wrote:
>
>> Provided some hairy conditions, the switch instruction will optimize to
>> a jump table in GCC and probably most C compilers.
>>
>>
> They do, you have no guaranties. It will either fly or crawl, depending on
> sparseness of values. Even so it *usually* makes table, just check on it
> from time to time ;).


It does, sure, but that switch must then be wrapped in looping logic, which
adds extra code and branching to the loop.


 In ICC, some static analysis is even used to optimize out the test
>> before the switch.
>>
>>  But switch-jump alone is not enough to get efficient VM interpreter.
> The thing is this tiny bit at the end of every instruction code:
> ....
>  goto opcodes[pProgram[regs.PC++]];
>
> This is instruction dispatch, the trick in the branch prediction that
> operates on per branch basis, thus single switch-jump based VM dispatch will
> mispredict jumps most of the time. I seen claims of up to 99% on average.
> If you place a whole switch at the end of every instruction code I'm not
> sure compiler will find it's way out of this mess, or even optimize it. I
> tried that with DMD, the problem is it have no idea how to use the *same*
> jump *table* for all of identical switches.


I don't quite follow what you mean here... so forgive me if I'm way off, or
if you're actually agreeing with me :)
Yes, this piece of code is effectively identical to the switch that may fall
at the top of an execute loop (assuming the switch compiles a jump table),
but there are some subtle differences...
As said before, there's no outer loop, which saves some code and a branch,
but more importantly, on architectures that have branch target registers,
the compiler can schedule the load to the branch target earlier (while the
rest of the 'opcode' is being executed), which will give the processor more
time to preload the instruction stream from the branch destination, which
will reduce the impact of the branch misprediction.

I'm not sure what you're trying to say about the branch prediction, but this
isn't a 'branch' (nor is a switch that produces a jump table), it's a jump,
and it will never predict since it doesn't have a binary target. Many
architectures attempt to alleviate this sort of unknown target penalty by
introducing a branch target register, which once loaded, will immediately
start fetching opcodes from the target.. the key for the processor is to
load the target address into that register as early as possible to hide the
instruction fetch latency. Code written in the style I illustrate will give
the best possible outcome to that end.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20111029/a1d9802a/attachment.html>


More information about the Digitalmars-d mailing list