backporting features to D1

Sun Oct 12 07:59:18 PDT 2008

Frank Benoit wrote:
 > How can you guess anything here?
 >
 > Old behavior == just the direct call, no extra cost
 > New behavior == same direct call plus a heap allocation
 >                 with unbound cost

There are plenty of randomized algorithms with potentially unbounded 
cost that are still used. In practice, potentially unbounded things work 
in a reasonable amount of time or get replaced.

The major reason that a heap allocation incurs a potentially unbounded 
cost is that allocation can trigger a collection. And a collection is 
potentially unbounded in duration because of user-defined destructors. 
If you write no destructors, then the cost of garbage collection is 
bounded by the amount of allocated memory. If your destructors complete 
in constant time, same thing.

It's like saying "The cost of invoking a method is unbounded." 
Technically, this is true; but if you're trying to benchmark the runtime 
or the language, that truth isn't the one in which you are interested.

 > There is nothing like a typical application.

This argument is a bit solipsist, and not remotely useful. You simply 
lack the appropriate data to determine which library types and functions 
are used with what frequency.

 > There is nothing like a
 > typical calling rate to full closures. Either you use them or you don't.

Again, you simply lack sufficient data. If an application uses closures, 
it might create one every minute or one every second or a thousand per 
second, on average. You can track this.

 > It depends on your code. So even if you port some code and do some
 > tests, there is no significant result for my code.
 >
 > There _can_ be 100x performance hit. Code that was allocation free is
 > now no more. That is super bad.

Well then, how about a microbenchmark?

I tried creating a delegate in a tight loop and executing it -- 10 
million iterations took 0.506 seconds using d1. Using closures in d2, it 
took 3.958 seconds. This is doing nothing but allocating and using closures.

Using more local variables would result in more allocation, but with 5 
integers instead of 1, the time only increased to 5.484 seconds. That's 
still 1.8 million delegate allocations per second. (If you have a very 
large struct on the stack, though, this cost can increase significantly, 
but that's a potential performance issue in itself.)

Considering these results, if overzealous heap activity due to delegates 
is the main issue you have to worry about, your application must use a 
very large number of closures.

While this does violate Tango's contracts about heap allocation, once 
scope delegates are available, it won't take much work to replace 
"delegate" with "scope delegate" everywhere in Tango.

The benchmarks:
---
// d1
int count = 0;

void foo (void delegate(int) dg)
{
     dg (1);
     dg (2);
}

void main ()
{
     uint max = 10_000_000;
     for (int j = 0; j < max; j++)
     {
         foo (makeDelegate);
     }
}

int x = 2;
void delegate(int) makeDelegate ()
{
     return (int i)
         {
             count += i;
             count += x;
         };
}
---

---
// d2
int count = 0;

void foo (void delegate(int) dg)
{
     dg (1);
     dg (2);
}

void main ()
{
     uint max = 10_000_000;
     for (int j = 0; j < max; j++)
     {
         foo (makeDelegate);
     }
}

void delegate(int) makeDelegate ()
{
     int x = 2;
/* // Uncomment to see the effects of additional local variables.
     int a = 2 * x;
     int b = 2 + a;
     int c = (2 - 6 * a) + b;
     int d = x * (3 + c);
*/
     return (int i)
         {
             count += i;
             count += x;
/*
             count += a;
             count += b;
             count += c;
             count += d;
*/
         };
}
---

The test machine:
Dell Optiplex GX110
Processor: Pentium III clocked at 1GHz (997.473 MHz according to 
/proc/cpuinfo)
Memory: 128MB; unknown type