A modest proposal: eliminate template code bloat

Sun Apr 8 11:49:43 PDT 2012

On 08.04.2012 21:24, H. S. Teoh wrote:
>
> Yeah, that's what I was thinking of. This would be a very big gain for
> the new AA implementation, for example. I wouldn't have to worry so much
> about template bloat if most of the instantiations are going to get
> merged anyway. :-)
>

Right the advantage is that one doesn't have to use tricks. One can
simply assume the compiler is doing it's job. That's what happened with 
inlining some 10+ years ago.

>
> [...]
>>> Not really... string pooling can take advantage of overlapping
>>> (sub)strings, but I don't think you can do that with code. But I
>>> think your idea has a lot of merit. I'm for making linkers smarter
>>> than they currently are.
>>>
>>
>> Sorry, it's just me running ahead of train somewhat. Basically once
>> this initial version is in place I have one cool refinement for it in
>> mind. For now we just need to keep the hash function transitive and
>> associative for the gods sake. 128/64bit checksum please ;)
> [...]
>
> And what would the cool refinement be? :-)
>
>
OK, you sort of forced my hand.
I hope you have been warned about spoilers before ;)

The refinement is merging prefixes and suffixes of course.
And for that one needs to calculate hashes for all of prefixes and all 
of suffixes. I will define _all_ later on.

First observation is that if you calculated partial checksums for 
prefixes you have sums for all suffixes and vise-versa.
Namely:
//prefix ends on i, suffix start on i
sum_prefix[i] = total - sum_suffix[i];

that is derived from the fact that:
total = sum_prefix[i] + sum_suffix[i];
(taking into account the properties of +/- in the finite field)

Now take the original idea, substitute checksum with an array of partial 
sums and lengths (btw lengths follow the same rule of prefix<->suffix) 
and you know what this refinement all about.

In fact this all was easy, the trick is fitting this into the time frame 
of compilation.

To that end one should do full duplicate elimination first then mess 
with merging.

Then we see that generally we are about to do O((M1*...*Mn)^2) work 
where n - is number of functions, Mi - number of prefixes (= suffixes) 
we defined for each. The constant C in front of (M1...Mn)^2 here is not 
that big - it's comparing checksums & lengths but *sometimes* memcmp 
still keep in mind that the number of all functions is big.

So we have to get around this monstrous workload. At the moment I see 
the only way is doing this is use coarse grained prefixes and introduce 
some threshold on maximum number of prefixes we account for. That about 
defining above mentioned _all_ for prefixes.

Coarse grained means we do store partial checksums on a fixed block 
boundary (say every 16 or 32 bytes) if that aligns properly with 
instructions if not we just skip this prefix and move on.
(this also hopefully limits memory usage on partial sums array)

Another threshold is that we don't mess with partial sums for really BIG 
functions. (might be not needed)

I think there is some space for other heuristics to try out but they are 
mostly along the same lines.

P.S. Damn, I could have done a nice paper on that... too late :)

-- 
Dmitry Olshansky