Faster Virtual Method Dispatch

Sun Apr 23 13:23:54 PDT 2006

In article <e2gkms$rur$1 at digitaldaemon.com>, kris says...
>
>Craig Black wrote:
>> Most programmers do not realize the overhead involved with calling a method 
>> virtually.  A virtual method call is an O(1) operation.  Just  a vtable 
>> lookup and a invoking a function pointer.  However, we must also consider 
>> that the CPU has no way of performing branch prediction with vtable lookups, 
>> so the pipeline is trashed.  Thus, virtual method calls incur a large 
>> performance pentalty.
>> 
>> Interestly, I found that one of the MSVC++ 2005 compiler's profile guided 
>> optimizations is replacing the vtable lookup with if statements for the most 
>> commonly called virtual methods followed by the vtable lookup, so that 
>> branch prediction can be performed in the majority of cases.
>> 
>> I do not consider myself an expert in the area of compiler optimization, but 
>> knowing the little that I do know I have come up with a simple approach that 
>> may improve performance.  This approach is described below.
>> 
>> Performing a binary search using if statements, and then invoking the 
>> methods directly rather than using fuction pointers would allow the method 
>> dispatch to take advantage of pipelining.  Algorithmically, however, this is 
>> no longer O(1), but O(log n), which is bad.  So I would assume that it would 
>> only be faster if n (or the number of methods in the vtable) is lower than a 
>> given threshold.  This threshold would be determined by testing and would 
>> perhaps be different on different processors.  Thus, in order to ensure that 
>> this approach is always faster, a relatively low number should be used.
>> 
>> Like I said I'm not an expert when it comes to low-level optimization 
>> strategies.  I don't know how well such an algorithm would work in practice. 
>> Comments are obviously welcome.
>> 
>> -Craig 
>> 
>> 
>
>
>I'm no expert either, Craig, yet it seems that there's more to it? An 
>indirect address lookup (via a vtable) doesn't trash the pipeline per 
>se; I forget the cycles involved but there's very little overhead at all 
>if the vtable is a "popular" one, or otherwise happens to be in the 

I'm not following you train of thought completely but...
I don't think that the cache hit is the issue, I think that the fact that the
function call is to an interdict address is the issue. The CPU might not be able
to safely predict where it's supposed to go until it get all the way to the
function call. As a result the instruction queue gets completely drained.

>cache lines. None of the other in-flight instructions have to be flushed 
>at that time. If the vtable is not cached, then a bubble will appear in 
>the pipeline while main-memory is addressed. But, again (as I understand 
>it), this doesn't invalidate any in-flight instructions; just feeds the 
>core less efficiently.
>
>On the other hand, a branch misprediction must invalidate a large number 
>of in-flight instruction; usually all of them (unless multi-path 
>speculative execution is occuring). The P4 is particularly maimed by 
>this, as compared to its peers, because of its particularly long 
>pipeline (which, in turn, was needed to reached the frequencies Intel 
>marketeers felt they needed to drive AMD off the map).
>
>Obviously I don't know enough about this, but it might be interesting to 
>research how one might introduce some locality-of-reference bias on the 
>more common calls made. For instance, some fancy runtime might rewrite 
>"common" calls to a dynamically constructed "global vtable". That would 
>perhaps avoid the misprediction concerns, whilst managing to retain the 
>globally popular set of lookups within cache?
>
>Clearly MS compiler-engineers have noticed something beneficial. On the 
>other hand, it seems a bit like flogging a dead-horse to introduce 
>something that will likely cause at least one branch misprediction? I 
>suppose the idea is that a direct-call (versus an indirect one) will 
>enable speculative execution within the impending function-call itself?
>
>Then again, heavyweight speculative execution is starting to lose favour 
>against a grouping of simple symmetrical cores on one die (a la Cell, 
>Kilocore, Niagara, etc). The idea there is that it doesn't matter if a 
>bubble appears, or a core stalls; if you partition an app to take 
>advantage of more than one execution thread, then it will (overall) 
>execute faster anyway. The silicon saved via simplicity is applied to 
>other areas such as massive on-die bandwidth. Of course, that doesn't 
>help single-threaded applications, but the industry is changing, albeit 
>slowly :)
>
>Wither the Transputer? <g>
>
>2 Cents