goto [variable], and address of code labels
Dmitry Olshansky
dmitry.olsh at gmail.com
Sat Oct 29 06:47:25 PDT 2011
On 29.10.2011 15:16, Manu wrote:
> On 29 October 2011 10:55, Dmitry Olshansky <dmitry.olsh at gmail.com
> <mailto:dmitry.olsh at gmail.com>> wrote:
>
> On 29.10.2011 3:15, Manu wrote:
>
>
> This is instruction dispatch, the trick in the branch prediction
> that operates on per branch basis, thus single switch-jump
> based VM
> dispatch will mispredict jumps most of the time. I seen
> claims of up
> to 99% on average.
> If you place a whole switch at the end of every instruction
> code I'm
> not sure compiler will find it's way out of this mess, or even
> optimize it. I tried that with DMD, the problem is it have
> no idea
> how to use the *same* jump *table* for all of identical
> switches.
>
>
> I don't quite follow what you mean here... so forgive me if I'm
> way off,
> or if you're actually agreeing with me :)
> Yes, this piece of code is effectively identical to the switch
> that may
> fall at the top of an execute loop (assuming the switch compiles
> a jump
> table), but there are some subtle differences...
> As said before, there's no outer loop, which saves some code and a
> branch,
>
>
> It doesn't save code - you replace unconditional jump to start of
> switch, with an extra load from table + indirect jump per VM opcode.
>
>
> True, it may appear there is more code, but the pipeline is far more
> overlapped, it can load the jump target earlier, also there is still
> less work whichever way you slice it.
Yes, I was just arguing that code size is *strictly speaking* larger.
Yet it's faster, but saving on extra uncoditional jump is not the only
reason.
>
> but more importantly, on architectures that have branch target
>
> registers, the compiler can schedule the load to the branch target
> earlier (while the rest of the 'opcode' is being executed),
> which will
> give the processor more time to preload the instruction stream
> from the
> branch destination, which will reduce the impact of the branch
> misprediction.
>
>
> I was thinking x86 here, but there are other processors that do not
> have branch target register. And depending on the pipeline length
> prediction it should be done earlier then loading of destination
> address.
>
>
> I'm not sure what you mean here about pipeline length prediction...
Ouch.
It should have been: depending on the pipeline length, prediction should
have started earlier then loading destination address.
but
> with respect to x86, and other performance orientated architectures, I
> don't see how this could disadvantage the processor in any way. While an
> out-of-order processor like x86 may be able to look ahead, resolve an
> unconditional jump back to the top of a switch, and begin resolving the
> next target ahead of time, having the dispatch inline just means there
> is no unconditional jump, a couple less instructions to process.
>
No objections here, it is faster.
>
> I'm not sure what you're trying to say about the branch
> prediction, but
> this isn't a 'branch' (nor is a switch that produces a jump
> table), it's
> a jump, and it will never predict since it doesn't have a binary
> target.
>
>
> I'm not sure about terminology, but branch === jump, and both can be
> conditional. By switch-jump I mean tabulated indirect jump, that is
> target is loaded from table.
> It's an indirect jump and as such can go wherever it wants to. And
> yes it's still predicted, mostly on basis of "it will go to the same
> address as last time". Actually, branch benediction mechanisms I
> heard of don't know if some jump is conditional at all (that would
> complicate it), the end result is just a block of "branch/jump at
> address X is predicted to go to address Y".
>
>
> I would say branch != jump; a branch is conditional (I think the term
> 'branch' implies conditionality), and is a candidate for prediction.
> Branch opcodes are more often than not to absolute targets too. A jump
> is unconditional, the branch prediction hardware has no effect on jump
> opcodes, and naturally it may be an absolute jump (performs more or less
> like a noop), or an indirect one, which is the worst case.
Indirect jumps are still being predicted, that's what I've been arguing.
Anyway, it's a well known stuff called threaded code, and I'm sure there
are better sources then my mumbling about optimizing it:
http://www.complang.tuwien.ac.at/anton/lvas/sem06w/revucky.pdf
cpu's will
> typically try and optimise this by prefetching the code at the earliest
> moment it can know the target. The mechanics of which are quite
> different for out-of-order processors, and architectures like PPC, which
> have a specific register indirect jump targets.
> An x86 or some other OoO processor may look ahead far enough to find the
> target in advance, but I'm not sure about that. Simplier architectures
> will not be able to do that (intel atom/ppc/arm/mips).
>
>
> Many architectures attempt to alleviate this sort of unknown target
> penalty by introducing a branch target register, which once
> loaded, will
> immediately start fetching opcodes from the target.. the key for the
> processor is to load the target address into that register as
> early as
> possible to hide the instruction fetch latency. Code written in the
> style I illustrate will give the best possible outcome to that end.
>
>
> Nice to know, which ones by the way?
>
>
> PowerPC being the most important one (only one that's relevant these
> days), ie, all games consoles. But also some popular microcontrollers.
> But I'm fairly sure most popular architectures will do the same thing
> where the jump target comes from a general register (ARM). The sooner
> the target is known, the better.
>
>
> I don't see any real disadvantage to this syntax. It's optional, has no
> side effects, and allows you to express something that can't be
> expressed any other way.
> I imagined it would be easy to implement, and it's definitely useful in
> certain circumstances, a handy tool to have, particularly when working
> on embedded platforms where compiler backends/optimisers are always weak
> and immature, and these sorts of performance considerations are more
> important (ie, games consoles, microcontrollers).
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list