dmd codegen improvements

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Thu Aug 20 16:56:29 PDT 2015


On Tue, Aug 18, 2015 at 04:30:26PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/18/2015 4:05 PM, H. S. Teoh via Digitalmars-d wrote:
> >Maybe when I get some free time this week, I could look at the
> >disassembly of one of my programs again to give some specific
> >examples.
> 
> Please do.

Rather than try to reduce one of my programs into a self-contained test
case, I decided instead to write a contrived range-based program that
exemplifies the limitations of dmd as I have observed. The program code
follows; the comments at the end describe the results, analysis, and my
conclusions.

I didn't include the assembly listings, as it's rather long, but if
people are interested I'll trim them down to the parts of interest and
post them in a follow-up post.

----------------------snip-----------------------
/* Crude benchmark for comparing dmd/gdc output. */

import std.algorithm : map, filter, reduce;
import std.conv : to;
import std.range : iota;
import std.stdio : writeln;

auto fun(int n)
{
	return iota(n)
		.map!(a => a*7)
		.filter!(a => a % 2 == 0)
		.reduce!((a,b) => a/2 + b);
}

void main() {
	writeln(fun(100_000_000));
}

/* RESULTS:

Compiled with:

	dmd -release -O -inline test.d -oftest.dmd
	gdc -frelease -finline -O3 test.d -o test.gdc

(dmd git HEAD, gdc 5.2.1)

Execution times:

	% time test.dmd ; time test.gdc
	1399999944

	real	0m0.628s
	user	0m0.627s
	sys	0m0.001s
	1399999944

	real	0m0.168s
	user	0m0.167s
	sys	0m0.000s
	%

As can be seen, the executable produced by gdc runs about 3.7 times faster than
the executable produced by dmd. Why?

Looking at the disassembly, the first thing that stands out is that gdc has
inlined the call to writeln, whereas dmd calls a separate function. While this
isn't the bottleneck, it gives a hint of the things to come. We look next at
fun(), which in both cases are standalone functions.

The dmd version of fun() is pretty straightforward: it calls iota() to create
the range, then map(), then filter(), and finally reduce() where most of the
work takes place. This is pretty much a direct translation of the code.

The gdc version of fun() is markedly different. We immediately notice that the
only function call in it is a call to enforce(). Not only the first level calls
to iota, map, filter, reduce have been inlined, pretty much their *entire* call
trees have been inlined, except for the call to enforce().

Here I'd like to highlight the fact that in the dmd version of the code, the
call to iota() has another function call to the ctor of the returned range. So
we see that gdc has inlined two levels of function calls where dmd has inlined
none, even though one would expect that with -inline, at least the call from
iota() to the ctor of the range should have been inlined, since it's the only
place where that ctor would be called; iota() itself being merely a thin
wrapper around it. (The ctor itself is also pretty simple; I'm not sure why dmd
fails to inline it.)

Similar remarks apply to the calls to map() and filter() as well.

Now let's look at reduce(), which is where the actual action takes place. The
dmd version, of course, involves a separate function call, which in the grand
scheme of things isn't all that important, since it's only a single function
call. However, a look at the disassembly of reduce() shows that dmd has not
inlined the calls to .empty, .front, and .popFront. In fact, the function calls
yet another function -- reduceImpl -- where the main loop sits. Inside this
main loop, .empty, .front, and .popFront are again called with no inlining --
even though .empty has a trivial body, .front involves only 1 multiplication,
and .popFront only 1 multiplication and a single odd/even test.  On top of
this, each of these nested function calls involve a certain amount of
boilerplate: twiddling with the stack registers, shuffling call arguments
about, etc., that add up to quite a large overhead in reduceImpl's inner loop.

The gdc version, by contrast, inlines *everything*, except the call to
enforce() which is outside the inner loop. This aggressive inlining allowed gdc
to trim the loop body down to about only 18 instructions with no function
calls.  While the dmd inner loop itself has only 15 instructions, it involves 3
function calls, with .front having 8 instructions, .empty also 8 instructions,
and .popFront 13 instructions, making a total of 44 instructions per iteration.
A significant percentage of these instructions are function call boilerplate.
The entire inner loop in the gdc version would fit in about 4-5 CPU cache
lines, whereas the dmd version would require a lot more.

To dmd's credit, it did manage to inline the nested calls in .empty, .front,
and .popFront(), which would have involved more function calls when no inlining
at all is done (each wrapper range forwards the calls to the next). This
probably helped to reduce the cost of running their respective function bodies.
However, this isn't quite enough, since the overhead of 3 function calls in the
inner loop is pretty expensive when the cost could have been eliminated
completely, as gdc had done.

*/
----------------------snip-----------------------


T

-- 
Political correctness: socially-sanctioned hypocrisy.


More information about the Digitalmars-d mailing list