Faster Virtual Method Dispatch
kris
foo at bar.com
Sun Apr 23 12:31:46 PDT 2006
Craig Black wrote:
> Most programmers do not realize the overhead involved with calling a method
> virtually. A virtual method call is an O(1) operation. Just a vtable
> lookup and a invoking a function pointer. However, we must also consider
> that the CPU has no way of performing branch prediction with vtable lookups,
> so the pipeline is trashed. Thus, virtual method calls incur a large
> performance pentalty.
>
> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided
> optimizations is replacing the vtable lookup with if statements for the most
> commonly called virtual methods followed by the vtable lookup, so that
> branch prediction can be performed in the majority of cases.
>
> I do not consider myself an expert in the area of compiler optimization, but
> knowing the little that I do know I have come up with a simple approach that
> may improve performance. This approach is described below.
>
> Performing a binary search using if statements, and then invoking the
> methods directly rather than using fuction pointers would allow the method
> dispatch to take advantage of pipelining. Algorithmically, however, this is
> no longer O(1), but O(log n), which is bad. So I would assume that it would
> only be faster if n (or the number of methods in the vtable) is lower than a
> given threshold. This threshold would be determined by testing and would
> perhaps be different on different processors. Thus, in order to ensure that
> this approach is always faster, a relatively low number should be used.
>
> Like I said I'm not an expert when it comes to low-level optimization
> strategies. I don't know how well such an algorithm would work in practice.
> Comments are obviously welcome.
>
> -Craig
>
>
I'm no expert either, Craig, yet it seems that there's more to it? An
indirect address lookup (via a vtable) doesn't trash the pipeline per
se; I forget the cycles involved but there's very little overhead at all
if the vtable is a "popular" one, or otherwise happens to be in the
cache lines. None of the other in-flight instructions have to be flushed
at that time. If the vtable is not cached, then a bubble will appear in
the pipeline while main-memory is addressed. But, again (as I understand
it), this doesn't invalidate any in-flight instructions; just feeds the
core less efficiently.
On the other hand, a branch misprediction must invalidate a large number
of in-flight instruction; usually all of them (unless multi-path
speculative execution is occuring). The P4 is particularly maimed by
this, as compared to its peers, because of its particularly long
pipeline (which, in turn, was needed to reached the frequencies Intel
marketeers felt they needed to drive AMD off the map).
Obviously I don't know enough about this, but it might be interesting to
research how one might introduce some locality-of-reference bias on the
more common calls made. For instance, some fancy runtime might rewrite
"common" calls to a dynamically constructed "global vtable". That would
perhaps avoid the misprediction concerns, whilst managing to retain the
globally popular set of lookups within cache?
Clearly MS compiler-engineers have noticed something beneficial. On the
other hand, it seems a bit like flogging a dead-horse to introduce
something that will likely cause at least one branch misprediction? I
suppose the idea is that a direct-call (versus an indirect one) will
enable speculative execution within the impending function-call itself?
Then again, heavyweight speculative execution is starting to lose favour
against a grouping of simple symmetrical cores on one die (a la Cell,
Kilocore, Niagara, etc). The idea there is that it doesn't matter if a
bubble appears, or a core stalls; if you partition an app to take
advantage of more than one execution thread, then it will (overall)
execute faster anyway. The silicon saved via simplicity is applied to
other areas such as massive on-die bandwidth. Of course, that doesn't
help single-threaded applications, but the industry is changing, albeit
slowly :)
Wither the Transputer? <g>
2 Cents
More information about the Digitalmars-d
mailing list