Compilation times and idiomatic D code

Fri Jul 14 15:45:44 PDT 2017

On Thu, Jul 13, 2017 at 03:27:31PM -0700, H. S. Teoh via Digitalmars-d wrote:
> On Thu, Jul 13, 2017 at 04:16:50PM -0400, Steven Schveighoffer via Digitalmars-d wrote:
> [...]
> > http://www.schveiguy.com/blog/2016/05/have-your-voldemort-types-and-keep-your-disk-space-too/
> [...]
> 
> Whoa.  I just tried your 'horcrux' trick, and discovered that it
> causes a further reduction in executable size!
> 
> Original sizes:
> 	38144296 bytes (38.1 MB)
> 	61412056 bytes (61.4 MB)
> 
> After factoring out Voldemort structs as module-global private structs:
> 
> 	 8332352 bytes (8.3 MB)
> 	10376584 bytes (10.4 MB)
> 
> After further refactoring to use the "horcrux" trick:
> 	 8147392 bytes (8.1 MB)
> 	10124136 bytes (10.1 MB)
[...]

Here's a further update to the saga of combating ridiculously large
symbol sizes.

So yesterday I wrote a new module that also heavily uses UFCS chains.
My initial draft of the module, once I linked it with the main program,
particularly with a UFCS chain that has led to the 600MB executable
sizes seen above, caused another explosion in symbol size that actually
managed to reach 100MB in *one* symbol, triggering a DMD termination
complaining about possible infinite template recursion. :-D  Funnier
still, temporarily simplifying part of the chain to bring symbol sizes
down, I eventually got it below 100MB but ended up with linker segfaults
and ELF errors because the huge symbol was too ridiculously huge.

Eventually, it drove me to refactor two Phobos functions that are used
heavily in my code: std.range.chunks and std.algorithm.joiner, using the
same "horcrux" technique (see Phobos PRs #5610 and #5611). This,
together with some further refactoring in my own code, eventually
brought things down to the 20MB range of executable sizes.

Then an idea occurred to me: the reason these symbol sizes got so large,
was because the UFCS chain preserves *all* type information about every
component type used to build the final type. So, by necessity, the type
name has to somehow encode all of this information in an unambiguous
way.  Now, arguably, DMD's current mangling scheme is at fault because
it contains too many repeating components, but even if you disregard
that, the fact remains that if you have 50+ components in your overall
UFCS chain, the symbol length cannot be less than 50*n where n is the
average length of a single component's type name, plus some necessary
overhead to account for the mangling scheme syntax. Let's say n is on
average 20-25 characters, say round it up to 35 for mangling syntax, so
you're still looking at upwards of 1700+ characters *minimum*.  That,
coupled with the current O(n^2) / O(n^3) mangling scheme, you easily
reach megabytes of symbol length.  We can compress the symbols all we
want, but there's a limit as to how much compression will help. At the
end of the day, you still have to represent those 50+ components
*somehow*.

But what if most of this type information is actually *unnecessary*?  To
use a range example, if all you care about at the end of the chain is
that it's a forward range of ubyte, then why even bother with multi-MB
symbols encoding type information that's never actually used?  Maybe a
little type-erasure would help, via hiding those 50+ component types
behind an opaque runtime OO polymorphic interface.  Phobos does have the
facilities of this, in the form of the InputRange, ForwardRange, etc.,
interfaces in std.range.interfaces.  In my case, however, part of the
chain uses another generic type (a kind of generalization of 2D arrays).
But either way, the idea is simple: at the end of the UFCS chain, wrap
it in a class object that inherits from a generic interface that encodes
only the primitives of the 2D array concept, and the element type. The
class object itself, of course, still must retain knowledge of the
full-fledged type, but the trick is that if we insert this type erasure
in the middle of the chain, then later components don't have to encode
the type names of earlier components anymore.

So, tl;dr:

	// Before:
	auto r = input
	     .ufcsFunc1
	     .ufcsFunc2
	     .map!(e => e.ufcsFunc3.ufcsFunc4 ...)
	     ...
	     .ufcsFuncn
	     .flattenToRange
	     .rangeFunc1
	     .rangeFunc2
	     ...;

Result: 20MB executable size (600MB before the "horcrux" technique was
applied).

	// After:
	input.ufcsFunc1
	     .ufcsFunc2
	     .map!(e => e.ufcsFunc3.ufcsFunc4 ...)
	     ...
	     .ufcsFuncn
	     .wrapWithInterface	// <--- type erasure happens here
	     .flattenToRange
	     .rangeFunc1
	     .rangeFunc2
	     ...;

Result: 8MB executable size (!).

Of course, this comes at a runtime cost of an added level of indirection
(now you have to call a virtual function through the interface to
integrate with earlier components in the pipeline).  But at a 60%
reduction in executable size, I think it's a worthwhile tradeoff.  :-)

T

-- 
What's a "hot crossed bun"? An angry rabbit.