C++ concepts, templates, reducing code bloat

Mon Jul 20 15:41:33 PDT 2009

An interesting thread on Reddit:
http://www.reddit.com/r/programming/comments/92tnd/concepts_removed_from_c0x/

In the meantime while I was doing some testing Walter has already answered to a question shown by "deong":
>Possibly. I don't really know D, and I'm not completely in the know on C++ concepts either. One objection to the approach you've demonstrated versus C++ concepts is that your code is littered with the checks. The concepts way would define the constraints externally, and client code would simply use the defined name of the concept. You could mix concepts freely without needing to figure out how to assert the union of all the constraints. But since D already has at least basic functionality, and C++ apparently isn't going to get it anytime soon, I say you can feel as smug as you like. :)<

This code shown by Downs there doesn't work:

bool compare(T)(T left, T right) if (is(typeof(left < right))) {
  return left < right;
}

because of this:
http://d.puremagic.com/issues/show_bug.cgi?id=2782

This is an answer to deong:

import std.stdio: writeln;

template Comparable(T) {
    enum bool Comparable = is(typeof(T.init < T.init));
}

bool compare(T)(T left, T right) if (Comparable!T) {
    return left < right;
}

void main() {
    writeln(compare(5, 6), " ", compare(5, 6));
    writeln(compare("hello", 6), " ", compare(5, "hello"));
}

(__traits(compiles, ... can also be used).

----------------------

Related to the compilation and use of generic code, various languages work in different ways.
D/C++ (and ShedSkin, using full type inference) compile a different function for each different set of template types.

Java uses a single function, because even if it now has generics, at run time there's a single function that uses Objects.

Haskell used to (and maybe uses still) to create an associative array for each generic function, and then create code (even at runtime, lazily, if necessary) for each set of input types, and then such dictionary is used at runtime to find the right compiled block of code of the function (this is not dynamic typing, but it doesn't look that far to me).

Dynamic languages like Python, Ruby and JavaScript (and CLisp, sometimes) keep a single version of the function, and the single datum keeps their type at runtime. This is a low-performance solution to implement generic functions, but it's also simple to implement, use and understand.

Just-in time compilers for Python (Psyco) and JavaScript (TraceMonkey, V8, etc. But they use very different strategies, Tracemonkey is closer to what I am saying here, V8 uses something that's worse and better at the same time) (and Lua) create at runtime new instances of the functions, and then use them.

C# acts in yet another interesting way, not too much far from the Psyco strategy:
http://www.osl.iu.edu/publications/prints/2006/Gregor06:Concepts.pdf
>The adoption of the template inclusion model of compilation precludes separate compilation for C++ templates. In this C++ constrained generics (concepts) differ from C# and Java generics. In C# and Java the default is that all instances of a generic method share the same native code. However, C# has some flavor of the instantiation model: different code is generated for each different instantiation of a generic method whose type arguments are value types. To attain apparent separate compilation, this instantiation-specific code is generated at run time.<

In theory D strategy is the one that leads to faster code at runtime (and doesn't require a VM like the C# case), but the binary has to contain all the compiled versions of a templated function, this can grow the binary size a lot. A big binary can be a problem also because there's more code that comes through the half-L1 cache (just 32 KB), this can and do reduce performance.

I can see two possible solutions to this problem:
- Many functions in D programs aren't performance-critical. If you "compile" one of such templated functions into a single dynamically typed function the performance of the code doesn't change, while the binary size is kept low. I think this is what the JavaHot spot does. LLVM that's behind LDC may be able to do this, and it can also probably allow to instantiate templates at run-time, but this looks more fit for a VM-based language, something different from the current D2 design (because it's generally not easy to know at compile time what template must be compiled and what one can be implemented with a dynamically typed piece of code, you need a profiled-guided compilation, or a VM that monitors the code at runtime like in the Java/C# case).
- Another possible solution is to find shared equal asm code in different functions coming from a single template function, and put such pieces of code only one time inside the compiled binary. This requires to split the functions in sections, and such sections have to be joined by nonconditional jumps. You may think such jumps slow down the code (and sometimes they surely slow down code, so you have to avoid juping in the middle of inner loops). Time ago I have read an article about this, it shows that this strategy also reduces binary size, and this reduces cache misses in L1-code enough to usually balance the slowdown caused by jumps (in many examples there was even a performance gain).

To show an extreme and unrealistic example I have created this Python program, that defined a D function template and then calls it in any combination of double and integer arguments (2^n possibilities, here n = 12):

from itertools import product

template = """\
version(Tango)
    import tango.stdc.stdio: printf;
else
    import std.stdio: printf;

float sum(%s)(%s) {
    return %s;
}

void main() {
%s
}"""

n = 12
s1 = ", ".join("T%d" % i for i in xrange(n))
s2 = ", ".join("T%d x%d" % (i,i) for i in xrange(n))
s3 = " + ".join("x%d" % i for i in xrange(n))

template2 = """    printf("%%f\\n", sum%s);"""
s4 = "\n".join(template2 % str(args) for args in product([1, 1.5], repeat=n))

result = template % (s1, s2, s3, s4)
file("test2.d", "w").write(result)

Compilation timings n = 12:
  DMD 2.031, compilation time,  binary size,  RAM used (Windows):
    Normal:  6.82               928_796       136
    -O:     17.03               947_228       138
  LDC,      compilation time,  binary size,   RAM used (Pubuntu):
    Normal: 28.7              2_601_228       198
    -O3:    26.7              1_115_888       159

On Pubuntu: 1_115_888 strip ==> 721_908 ==> zip 85_207 bytes

Later I have tried a shorter D program, with similar results (binary was even larger and compilation time longer):

version(Tango)
    import tango.stdc.stdio: printf;
else
    import std.stdio: printf;

float sum(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11)(T0 x0, T1 x1, T2 x2, T3 x3, T4 x4, T5 x5, T6 x6, T7 x7, T8 x8, T9 x9, T10 x10, T11 x11) {
    return x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11;
}

template Tuple(T...) { alias T Tuple; }

void main() {
    auto v = Tuple!(1, 1.5);
    foreach (i0; v) foreach (i1; v) foreach (i2; v) foreach (i3; v) foreach (i4; v) foreach (i5; v) foreach (i6; v) foreach (i7; v) foreach (i8; v) foreach (i9; v) foreach (i10; v) foreach (i11; v) printf("%f\n", sum(i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11));
}

This second D program, once compiled has a asm made of (cleaned up a bit):

"""
_D5test332__T3sumTdTiTdTiTiTdTdTdTiTdTdTiZ3sumFdidiidddiddiZf:
        enter   4,0
        mov -4[EBP],EAX
        fild    dword ptr 044h[EBP]
        fadd    qword ptr 048h[EBP]
        fadd    qword ptr 03Ch[EBP]
        fadd    dword ptr 038h[EBP]
        fadd    dword ptr 034h[EBP]
        fadd    qword ptr 02Ch[EBP]
        fadd    qword ptr 024h[EBP]
        fadd    qword ptr 01Ch[EBP]
        fadd    dword ptr 018h[EBP]
        fadd    qword ptr 010h[EBP]
        fadd    qword ptr 8[EBP]
        fadd    dword ptr -4[EBP]
        leave
        ret 048h

_D5test332__T3sumTdTiTdTiTiTdTdTdTiTdTdTdZ3sumFdidiidddidddZf:
        fild    dword ptr 048h[ESP]
        fadd    qword ptr 04Ch[ESP]
        fadd    qword ptr 040h[ESP]
        fadd    dword ptr 03Ch[ESP]
        fadd    dword ptr 038h[ESP]
        fadd    qword ptr 030h[ESP]
        fadd    qword ptr 028h[ESP]
        fadd    qword ptr 020h[ESP]
        fadd    dword ptr 01Ch[ESP]
        fadd    qword ptr 014h[ESP]
        fadd    qword ptr 0Ch[ESP]
        fadd    qword ptr 4[ESP]
        ret 050h

I don't think such functions share blocks of code that can be moved away.
Real code that contains real templates contains more block-redundancy.

Bye,
bearophile