Faster Virtual Method Dispatch

Sun Apr 23 12:31:46 PDT 2006

Craig Black wrote:
> Most programmers do not realize the overhead involved with calling a method 
> virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
> lookup and a invoking a function pointer.  However, we must also consider 
> that the CPU has no way of performing branch prediction with vtable lookups, 
> so the pipeline is trashed.  Thus, virtual method calls incur a large 
> performance pentalty.
> 
> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
> optimizations is replacing the vtable lookup with if statements for the most 
> commonly called virtual methods followed by the vtable lookup, so that 
> branch prediction can be performed in the majority of cases.
> 
> I do not consider myself an expert in the area of compiler optimization, but 
> knowing the little that I do know I have come up with a simple approach that 
> may improve performance.  This approach is described below.
> 
> Performing a binary search using if statements, and then invoking the 
> methods directly rather than using fuction pointers would allow the method 
> dispatch to take advantage of pipelining.  Algorithmically, however, this is 
> no longer O(1), but O(log n), which is bad.  So I would assume that it would 
> only be faster if n (or the number of methods in the vtable) is lower than a 
> given threshold.  This threshold would be determined by testing and would 
> perhaps be different on different processors.  Thus, in order to ensure that 
> this approach is always faster, a relatively low number should be used.
> 
> Like I said I'm not an expert when it comes to low-level optimization 
> strategies.  I don't know how well such an algorithm would work in practice. 
> Comments are obviously welcome.
> 
> -Craig 
> 
> 

I'm no expert either, Craig, yet it seems that there's more to it? An 
indirect address lookup (via a vtable) doesn't trash the pipeline per 
se; I forget the cycles involved but there's very little overhead at all 
if the vtable is a "popular" one, or otherwise happens to be in the 
cache lines. None of the other in-flight instructions have to be flushed 
at that time. If the vtable is not cached, then a bubble will appear in 
the pipeline while main-memory is addressed. But, again (as I understand 
it), this doesn't invalidate any in-flight instructions; just feeds the 
core less efficiently.

On the other hand, a branch misprediction must invalidate a large number 
of in-flight instruction; usually all of them (unless multi-path 
speculative execution is occuring). The P4 is particularly maimed by 
this, as compared to its peers, because of its particularly long 
pipeline (which, in turn, was needed to reached the frequencies Intel 
marketeers felt they needed to drive AMD off the map).

Obviously I don't know enough about this, but it might be interesting to 
research how one might introduce some locality-of-reference bias on the 
more common calls made. For instance, some fancy runtime might rewrite 
"common" calls to a dynamically constructed "global vtable". That would 
perhaps avoid the misprediction concerns, whilst managing to retain the 
globally popular set of lookups within cache?

Clearly MS compiler-engineers have noticed something beneficial. On the 
other hand, it seems a bit like flogging a dead-horse to introduce 
something that will likely cause at least one branch misprediction? I 
suppose the idea is that a direct-call (versus an indirect one) will 
enable speculative execution within the impending function-call itself?

Then again, heavyweight speculative execution is starting to lose favour 
against a grouping of simple symmetrical cores on one die (a la Cell, 
Kilocore, Niagara, etc). The idea there is that it doesn't matter if a 
bubble appears, or a core stalls; if you partition an app to take 
advantage of more than one execution thread, then it will (overall) 
execute faster anyway. The silicon saved via simplicity is applied to 
other areas such as massive on-die bandwidth. Of course, that doesn't 
help single-threaded applications, but the industry is changing, albeit 
slowly :)

Wither the Transputer? <g>

2 Cents