Disappointing performance from DMD/Phobos

Tue Jun 26 03:17:31 UTC 2018

On Monday, June 25, 2018 19:10:17 Manu via Digitalmars-d wrote:
> Some code:
> ---------------------------------
> struct Entity
> {
>   enum NumSystems = 4;
>   struct SystemData
>   {
>     uint start, length;
>   }
>   SystemData[NumSystems] systemData;
>   @property uint systemBits() const { return systemData[].map!(e =>
> e.length).sum; }
> }
> Entity e;
> e.systemBits(); // <- call the function, notice the codegen
> ---------------------------------
>
> This property sum's 4 ints... that should be insanely fast. It should
> also be something like 5-8 lines of asm.
> Turns out, that call to sum() is eating 2.5% of my total perf
> (significant among a substantial workload), and the call tree is quite
> deep.
>
> Basically, inliner tried, but failed to seal the deal, and leaves a
> call stack 7 levels deep.
>
> Pipeline programming is hip and also *recommended* D usage. The
> optimiser must do a good job. This is such a trivial workloop, and
> with constant length (4).
> I expect 3 integer adds to unroll and inline. A call-tree 7 levels
> deep is quite a ways from the mark.
>
> Maybe this is another instance of Walter's "phobos begat madness"
> observation? The unoptimised callstack is mental. Compiling with -O trims
> most of the noise in the call tree, but it fails to inline the remaining
> work which ends up 7-levels down a redundant call-tree.

dmd's inliner is notoriously poor, but I don't know how much effort has
really been put into fixing the problem. I do recall it being argued several
times that it only should only be in the backend and that there shouldn't be
one in the frontend, but either way, the typical solution seems to be to use
ldc instead of dmd if you really care about the performance of the generated
binary.

I don't follow dmd PRs closely, but I get the impression that far more
effort gets put into feature-related stuff and bug fixes than performance
improvements. Walter at least occasionally does performance improvements,
but when he talks about them, it seems like a number of folks react
negatively, thinking that his time would be better spent on features and the
like and that folks just use ldc for performance.

So, all in all, the result is not great for dmd's performance. I don't know
what the solution is, though I agree that we're better off if dmd generates
fast code in general even if it's not as good as what ldc does.

Regardless, if you can give simple test cases that clearly should be
generating far better code than they are, then at least there's a clear
target for improvement rather than just "dmd should generate faster code,"
so there's something clearly actionable.

- Jonathan M Davis