Inlining Code Test

Thu Dec 16 22:50:53 PST 2010

On Mon, 13 Dec 2010 01:50:50 +0300, Craig Black <craigblack2 at cox.net>  
wrote:

> The following program illustrates the problems with inlining in the dmd  
> compiler.  Perhaps with some more work I can reduce it to a smaller test  
> case.  I was playing around with a simple Array template, and noticed  
> that similar C++ code performs much better.  This is due, at least in  
> part, to opIndex not being properly inlined by dmd.  There are two sort  
> functions, quickSort1 and quickSort2.  quickSort1 indexes an Array data  
> structure. quickSort2 indexes raw pointers.  quickSort2 is roughly 20%  
> faster on my core i7.

Compiled with dmd v2.050/win32  -g -O -inline -release

First, I looked in debugger on actual asm and I must say inlining is done  
very well. Code for two versions is almost identical with slight overhead  
in case of Array for there is extra level of indirection in data access,  
inlining or not.

Second, I have anywhere from 3.3 to 6.7% difference in performance, but no  
more than that. Tested on Core2Duo E6300, Windows XP SP3. I increased  
number of iterations for benchmark!() to 5 to reduce volatility of  
results. That's the only change to source I did.

Third... Now here is a funny thing. Absolute times and difference between  
implementation depends on how do you run the program. I was dumbfounded as  
of how does it matter, but the fact is that aforementioned avg 5%  
difference I get if I run it with command line as "inline.exe". If I run  
it as "inline" without extension I get difference around 15% and absolute  
times are notably smaller.

X:\d\tests\craig>inline.exe
Sorting with Array.opIndex: 6533
Sorting with pointers: 6264
4.11756 percent faster

X:\d\tests\craig>inline
Sorting with Array.opIndex: 5390
Sorting with pointers: 4674
13.2839 percent faster

Something like that. It's not a fluke. I tested it on my old AthlonXp with  
XP SP2 and saw exactly the same picture (btw, difference in % between  
implementation was about the same).

I ran both variants under stracent and found no difference except one  
pointer on the stack when LeaveCriticalSection and GetCurrentThreadId are  
called was always off by 4 bytes. This made me thinking. The only  
observable difference is length of command line. And indeed, renaming  
program showed that only length of command line is a reason, not the  
content.

Further tests suggest that some value is either aligned to 8 byte or not  
depending on length of command line and this makes all the difference  
(which happens to be greater than difference between implementations of  
sorting). I couldn't find what value causes slowdown though.

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/