Inlining Code Test
Nick Voronin
elfy.nv at gmail.com
Thu Dec 16 22:50:53 PST 2010
On Mon, 13 Dec 2010 01:50:50 +0300, Craig Black <craigblack2 at cox.net>
wrote:
> The following program illustrates the problems with inlining in the dmd
> compiler. Perhaps with some more work I can reduce it to a smaller test
> case. I was playing around with a simple Array template, and noticed
> that similar C++ code performs much better. This is due, at least in
> part, to opIndex not being properly inlined by dmd. There are two sort
> functions, quickSort1 and quickSort2. quickSort1 indexes an Array data
> structure. quickSort2 indexes raw pointers. quickSort2 is roughly 20%
> faster on my core i7.
Compiled with dmd v2.050/win32 -g -O -inline -release
First, I looked in debugger on actual asm and I must say inlining is done
very well. Code for two versions is almost identical with slight overhead
in case of Array for there is extra level of indirection in data access,
inlining or not.
Second, I have anywhere from 3.3 to 6.7% difference in performance, but no
more than that. Tested on Core2Duo E6300, Windows XP SP3. I increased
number of iterations for benchmark!() to 5 to reduce volatility of
results. That's the only change to source I did.
Third... Now here is a funny thing. Absolute times and difference between
implementation depends on how do you run the program. I was dumbfounded as
of how does it matter, but the fact is that aforementioned avg 5%
difference I get if I run it with command line as "inline.exe". If I run
it as "inline" without extension I get difference around 15% and absolute
times are notably smaller.
X:\d\tests\craig>inline.exe
Sorting with Array.opIndex: 6533
Sorting with pointers: 6264
4.11756 percent faster
X:\d\tests\craig>inline
Sorting with Array.opIndex: 5390
Sorting with pointers: 4674
13.2839 percent faster
Something like that. It's not a fluke. I tested it on my old AthlonXp with
XP SP2 and saw exactly the same picture (btw, difference in % between
implementation was about the same).
I ran both variants under stracent and found no difference except one
pointer on the stack when LeaveCriticalSection and GetCurrentThreadId are
called was always off by 4 bytes. This made me thinking. The only
observable difference is length of command line. And indeed, renaming
program showed that only length of command line is a reason, not the
content.
Further tests suggest that some value is either aligned to 8 byte or not
depending on length of command line and this makes all the difference
(which happens to be greater than difference between implementations of
sorting). I couldn't find what value causes slowdown though.
--
Using Opera's revolutionary email client: http://www.opera.com/mail/
More information about the Digitalmars-d
mailing list