goto [variable], and address of code labels

Sat Oct 29 06:47:25 PDT 2011

On 29.10.2011 15:16, Manu wrote:
> On 29 October 2011 10:55, Dmitry Olshansky <dmitry.olsh at gmail.com
> <mailto:dmitry.olsh at gmail.com>> wrote:
>
>     On 29.10.2011 3:15, Manu wrote:
>
>
>             This is instruction dispatch, the trick in the branch prediction
>             that operates on per branch basis, thus single switch-jump
>         based VM
>             dispatch will mispredict jumps most of the time. I seen
>         claims of up
>             to 99% on average.
>             If you place a whole switch at the end of every instruction
>         code I'm
>             not sure compiler will find it's way out of this mess, or even
>             optimize it. I tried that with DMD, the problem is it have
>         no idea
>             how to use the *same* jump *table* for all of identical
>         switches.
>
>
>         I don't quite follow what you mean here... so forgive me if I'm
>         way off,
>         or if you're actually agreeing with me :)
>         Yes, this piece of code is effectively identical to the switch
>         that may
>         fall at the top of an execute loop (assuming the switch compiles
>         a jump
>         table), but there are some subtle differences...
>         As said before, there's no outer loop, which saves some code and a
>         branch,
>
>
>     It doesn't save code - you replace unconditional jump to start of
>     switch, with an extra load from table + indirect jump per VM opcode.
>
>
> True, it may appear there is more code, but the pipeline is far more
> overlapped, it can load the jump target earlier, also there is still
> less work whichever way you slice it.

Yes, I was just arguing that code size is *strictly speaking* larger. 
Yet it's faster, but saving on extra uncoditional jump is not the only 
reason.

>
>     but more importantly, on architectures that have branch target
>
>         registers, the compiler can schedule the load to the branch target
>         earlier (while the rest of the 'opcode' is being executed),
>         which will
>         give the processor more time to preload the instruction stream
>         from the
>         branch destination, which will reduce the impact of the branch
>         misprediction.
>
>
>     I was thinking x86 here, but there are other processors that do not
>     have branch target register. And depending on the pipeline length
>     prediction it should be done earlier then loading of destination
>     address.
>
>
> I'm not sure what you mean here about pipeline length prediction...

Ouch.
It should have been: depending on the pipeline length, prediction should 
have started earlier then loading destination address.

  but
> with respect to x86, and other performance orientated architectures, I
> don't see how this could disadvantage the processor in any way. While an
> out-of-order processor like x86 may be able to look ahead, resolve an
> unconditional jump back to the top of a switch, and begin resolving the
> next target ahead of time, having the dispatch inline just means there
> is no unconditional jump, a couple less instructions to process.
>

No objections here, it is faster.

>
>         I'm not sure what you're trying to say about the branch
>         prediction, but
>         this isn't a 'branch' (nor is a switch that produces a jump
>         table), it's
>         a jump, and it will never predict since it doesn't have a binary
>         target.
>
>
>     I'm not sure about terminology, but branch === jump, and both can be
>     conditional. By switch-jump I mean tabulated indirect jump, that is
>     target is loaded from table.
>     It's an indirect jump and as such can go wherever it wants to. And
>     yes it's still predicted, mostly on basis of "it will go to the same
>     address as last time". Actually, branch benediction mechanisms I
>     heard of don't know if some jump is conditional at all (that would
>     complicate it), the end result is just a block of "branch/jump at
>     address X is predicted to go to address Y".
>
>
> I would say branch != jump; a branch is conditional (I think the term
> 'branch' implies conditionality), and is a candidate for prediction.
> Branch opcodes are more often than not to absolute targets too. A jump
> is unconditional, the branch prediction hardware has no effect on jump
> opcodes, and naturally it may be an absolute jump (performs more or less
> like a noop), or an indirect one, which is the worst case.

Indirect jumps are still being predicted, that's what I've been arguing. 
Anyway, it's a well known stuff called threaded code, and I'm sure there 
are better sources then my mumbling about optimizing it:
http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf

  cpu's will
> typically try and optimise this by prefetching the code at the earliest
> moment it can know the target. The mechanics of which are quite
> different for out-of-order processors, and architectures like PPC, which
> have a specific register indirect jump targets.
> An x86 or some other OoO processor may look ahead far enough to find the
> target in advance, but I'm not sure about that. Simplier architectures
> will not be able to do that (intel atom/ppc/arm/mips).
>
>
>         Many architectures attempt to alleviate this sort of unknown target
>         penalty by introducing a branch target register, which once
>         loaded, will
>         immediately start fetching opcodes from the target.. the key for the
>         processor is to load the target address into that register as
>         early as
>         possible to hide the instruction fetch latency. Code written in the
>         style I illustrate will give the best possible outcome to that end.
>
>
>     Nice to know, which ones by the way?
>
>
> PowerPC being the most important one (only one that's relevant these
> days), ie, all games consoles. But also some popular microcontrollers.
> But I'm fairly sure most popular architectures will do the same thing
> where the jump target comes from a general register (ARM). The sooner
> the target is known, the better.
>
>
> I don't see any real disadvantage to this syntax. It's optional, has no
> side effects, and allows you to express something that can't be
> expressed any other way.
> I imagined it would be easy to implement, and it's definitely useful in
> certain circumstances, a handy tool to have, particularly when working
> on embedded platforms where compiler backends/optimisers are always weak
> and immature, and these sorts of performance considerations are more
> important (ie, games consoles, microcontrollers).

-- 
Dmitry Olshansky