Possible solution to template bloat problem?

Mon Aug 19 18:31:37 PDT 2013

On Tue, Aug 20, 2013 at 02:48:27AM +0200, Dicebot wrote:
> On Tuesday, 20 August 2013 at 00:34:38 UTC, Ramon wrote:
> >Well, I'm afraid that's what templates are. One (or the compiler)
> >fills them in and that's it.
> >
> >In other words: Templates are compile time while (real) generics
> >are run time. This basically comes down to have some way of
> >designating classes as, for instance, comparable and then either
> >running along the object chain comparing all built in objects
> >(with built in compare functionality) or having a compare
> >implemented (Of course, there is also arithmetic functions, etc.).

In that case, the "generics" you're talking about amounts basically to
OO polymorphism. You have the same machine code that can process diverse
types by using indirection to abstract away the implementation details.
This is no doubt useful, as OO itself proves, but it does come with a
cost: using indirection incurs a (small, but nevertheless non-zero)
performance hit. Inside inner loops, this can be a performance killer.

At the machine code level, it's actually not possible to use the same
code for, e.g., comparing two ints vs. comparing two floats. You need
different machine instructions for them, so there is no single piece of
code that can be reused for both types. You have to somehow switch
between them at runtime depending on what types are being passed in. (It
gets even hairier if you're comparing, say, ints and floats, in which
case additional instructions must be used for promoting one type to
another so that they are comparable.) So, say you call your function
"less". In order to actually run, you need one version for comparing
ints, another for comparing floats, etc.. This is what templates do.

Alternatively, you use indirection: int and float can be associated with
some static data structure that describes each type, say it has a
function pointer that, for ints, point to the function that compares
int, and for floats, point to the function that compares floats. Then
the caller doesn't actually directly call the low-level
int/float-specific functions, but they always look up the function
pointer. This is, in essence, how OO polymorphism works: each class has
a vtable with function pointers to the specific implementation of each
overloaded method. Then at runtime, you call whatever function the
vtable points to, thus achieving runtime genericity.

The problem with indirection is that it's expensive: given two objects
to be compared, you need to dereference the pointer to those objects,
then dereference the pointer to their respective vtables, then
dereference the function pointer in the vtables in order to call the
function.

Templates, OTOH, are compile-time bound: the compiler ascertains at
compile-time that your particular piece of code is comparing two ints,
so it bypasses all of the expensive runtime dereferencing, and directly
calls the function for comparing ints. The resulting code is inflexible,
in the sense that you can't change, at runtime, the arguments to floats
-- it would fail horribly -- but it avoids 3 pointer dereferences. When
inside an inner loop, this can mean the difference between smooth
animation and unusably jerky animation.

The cost, of course, is that if you need the same piece of code for
comparing both ints and floats, then the compiler has to generate two
copies of the code, one to handle ints, and the other to handle floats.

The saving grace, as Dicebot points out, is that if these copies of code
are small enough, they will be inlined, so you can even save on the cost
of a function call. This somewhat reduces the resulting code size -- you
save on function call instructions and stack push/pops, etc., but you're
still paying for the duplicated code.

This is the current state of the art.

Now, my dream is that one day, perhaps compilers will get smart enough
that you don't even need to worry about the distinction between
templates and runtime polymorphism anymore -- you specify what you want
to get done, and the compiler determines, based on characteristics of
the target machine and how the program as a whole will use that
particular piece of code, whether to use indirection or template
expansion. Or maybe even a mix of both, depending on the situation.

[...]
> Reminds me: how hard is writing own linker is again? :)

Honestly, I think it's about time linker technology is rethought and
developed further. Possible developements are automatic elision of
unused code sections (already implemented in some linkers), automatic
merging of identical sections (not sure if implemented yet -- may
require language support), link-time inlining, reordering of symbols to
increase code locality during execution (optimize for CPU caches), etc..

Or more ambitiously, better integration with compiler so that the linker
has access to compile-time structures to help it make decisions about
optimization. Present-day object file formats are too far along the
process to their executable form to permit many optimizations that could
in theory be performed by the linker, requiring instead hacks like weak
symbols, unused section GC, etc..

On the more mundane side, we need better algorithms for improving linker
performance. Current symbol resolution algorithms don't scale very well
when your object files are large or have large numbers of symbols.
Surely there are ways of improving the asymptotic complexity of these
things!

T

-- 
It is impossible to make anything foolproof because fools are so ingenious. -- Sammy