C++ concepts, templates, reducing code bloat

Tue Jul 21 03:19:11 PDT 2009

bearophile wrote:
> An interesting thread on Reddit:
> http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/
> 
> Related to the compilation and use of generic code, various languages work in different ways.
> D/C++ (and ShedSkin, using full type inference) compile a different function for each different set of template types.
> 
> Java uses a single function, because even if it now has generics, at run time there's a single function that uses Objects.

[...]
> In theory D strategy is the one that leads to faster code at runtime (and doesn't require a VM like the C# case), but the binary has to contain all the compiled versions of a templated function, this can grow the binary size a lot. A big binary can be a problem also because there's more code that comes through the half-L1 cache (just 32 KB), this can and do reduce performance.
> 
> I can see two possible solutions to this problem:
> - Many functions in D programs aren't performance-critical. If you "compile" one of such templated functions into a single dynamically typed function the performance of the code doesn't change, while the binary size is kept low. I think this is what the JavaHot spot does. LLVM that's behind LDC may be able to do this, and it can also probably allow to instantiate templates at run-time, but this looks more fit for a VM-based language, something different from the current D2 design (because it's generally not easy to know at compile time what template must be compiled and what one can be implemented with a dynamically typed piece of code, you need a profiled-guided compilation, or a VM that monitors the code at runtime like in the Java/C# case).
> - Another possible solution is to find shared equal asm code in different functions coming from a single template function, and put such pieces of code only one time inside the compiled binary. This requires to split the functions in sections, and such sections have to be joined by nonconditional jumps. You may think such jumps slow down the code (and sometimes they surely slow down code, so you have to avoid juping in the middle of inner loops). Time ago I have read an article about this, it shows that this strategy also reduces binary size, and this reduces cache misses in L1-code enough to usually balance the slowdown caused by jumps (in many examples there was even a performance gain).

I commonly find that I need to use templates for integral types, and 
that's only because of the silly implicit casting rules from C.
For example, initially I write
foo(long x) { ... }
and this covers long, int, uint, short, ushort, byte, ubyte.

But ulong needs to be treated specially. So I add:
foo(ulong x) { ... }

But now, thanks to the silly implicit casting rules, foo(1) is now 
ambiguous! So I need to add overloads for all of the other integral 
types. That's ridiculous, so I make foo() a template. And I get code bloat.

If there was a way to say:
foo(explicit ulong x) { ... } // MUST really be ulong, don't allow 
implicit integral conversions

then a huge fraction of the code bloat templates would disappear.

(for 'explicit', substitute any keyword of your choice).