reduce mangled name sizes via link-time symbol renaming

Fri Jan 26 19:09:08 UTC 2018

On Fri, Jan 26, 2018 at 08:34:50AM +0100, Johannes Pfau via Digitalmars-d wrote:
[...]
> What is the benefit of using link-time renaming (a linker specific
> feature) instead of directly renaming the symbol in the compiler? We
> could be quite radical and hash all symbols > a certain threshold. As
> long as we have a hash function with strong enough collision
> resistance there shouldn't be any problem.

I think this is something worthwhile to implement, or at least try out.
Huge symbols have been an ongoing source of trouble in D code, esp. when
there's heavy template usage.  Even after Rainer's symbol backref PR was
merged, which largely alleviated the recursive symbol bloat problem, we
still have cases like object.__switch that need to be addressed.

> AFAICS we only need the mapping hashed_name ==> full name for
> debugging. So maybe we can simply stuff the full, mangled name somehow
> into dwarf debug information? We can even keep dwarf debug information
> in external files and support for this is just being added to GCCs
> libbacktrace, so even stack traces could work fine.
[...]

I dunno, I'm skeptical that a 10,000-character symbol is of any use to
anyone, even for debugging. I mean, what are you going to do with it?
Visually scan 10,000 characters to see if it's the same symbol as
another 10,000-character symbol in the program? If the only way to make
practical use of it is to use a program to compare it, then substituting
it with a hash is not any different.

It seems to me that the most useful parts of a long symbol are basically
its initial segment, which is usually the module name, useful for
narrowing down where the symbol came from, and the ending segment,
usually the last symbol(s) of a UFCS chain, or some argument types,
useful for determining the function name, or which overload is being
called. Given a long enough symbol, the middle portion is pretty much
never looked at; it might as well be random characters.  Which suggests
the following scheme: if a symbol S exceeds N characters, for a
suitably-chosen N (I'd say somewhere around 500 or 1000, as a rough
initial stab), then replace it with:

	S[0 .. 80] ~ hashOf(s) ~ S[$-80 .. $]

This gives you 160 human-readable characters of the most useful parts of
the symbol, with the largely-useless middle part replaced with a
fixed-length hash, so in the worst case, the symbol will be around 2-3
lines long and no more.

I chose 80 arbitrarily, it can be longer or shorter, but it's
approximately the length of 1 line of code, which presumably should be
enough to uniquely identify the source module of the symbol as well as
the last function name / parameter types.  Perhaps it can be increased
to about 200 or so, give or take, so that compressed symbols are
approximately N characters long. Or N can be reduced to match the 160 +
the ASCII-encoded size of the hash.

T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi