Yet another strike against the current AA implementation

Mon Apr 27 07:51:22 PDT 2009

dsimcha wrote:
[snip]
> 
> Output:
> Direct:  2343
> Virtual:  5695
> opApply:  3014
> 
> Bottom line is that these pretty much come in the order you would expect them to,
> but there are no particularly drastic differences between any of them.  To put
> these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a
> 2.7 GHz machine) 15 clock cycles per iteration.  How often does anyone really have
> code that is performance critical *and* where the contents of the loop don't take
> long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you
> need the iteration to be polymorphic?

I edited this code to work with ldc (D1) + Tango, and saw the Direct and opApply 
cases generate identical code (inc, cmp, jne, with the loop counter in a 
register) [1], so they're equally fast (modulo process scheduling randomness).
Virtual was roughly 10 times slower on my machine. (with ldc)

Unfortunately, I can't directly compare timings between ldc and dmd directly 
because dmd is likely at a disadvantage due to being 32-bit in a 64-bit world.

Although... the Virtual case takes about equal time with ldc- and dmd-compiled 
code, so maybe the slowness of Direct/dmd when compared to Direct/ldc (the dmd 
code is a factor 3 slower) is due to it apparently not register-allocating the 
loop variable.

The opApply case was another factor 2 slower than Direct with dmd on my machine, 
probably because opApply and the loop body don't get inlined.

It seems gdc is the only compiler to realize the first loop can be completely 
optimized away. It's again about equally fast for Virtual, but for opApply it's 
roughly a factor 3 slower than ldc; it seem to inline only opApply itself, not 
the loop body.

[1]: -O3 -release (with inlining), x86_64