Escape analysis (full scope analysis proposal)

Wed Nov 12 21:53:50 PST 2008

Michel Fortin wrote:
> On 2008-11-12 10:02:02 -0500, Andrei Alexandrescu 
> <SeeWebsiteForEmail at erdani.org> said:
> 
>> Michel Fortin wrote:
>>> On 2008-11-09 10:10:03 -0500, Andrei Alexandrescu 
>>> <SeeWebsiteForEmail at erdani.org> said:
>>> So my first point is that since we have a garbage collector in D, and 
>>> moreover since we're likely to get one heap per thread in D2, we 
>>> don't need dynamic regions. The remaining regions are: 1) the shared 
>>> heap, 2) the thread-local heap, 3) All the stack frames; and you 
>>> can't allocate other stack frames than the current one. Because none 
>>> of these regions require a handle to allocate into, we (A) don't need 
>>> region handles.
>>>
>>> We still have many regions. Beside the two heaps (shared, 
>>> thread-local), each function's stack frame, and each block within 
>>> them, creates a distinct memory region. But nowhere we need to know 
>>> exactly which region a function parameter comes from; what we need to 
>>> know is which address outlives which pointer, and then we can forbid 
>>> assigning addresses to pointers that outlive them. All we need is a 
>>> relative ordering of the various regions, and for that we don't need 
>>> to attach *names* to the regions so that you can refer explicitly to 
>>> them in the syntax. Instead, you could say something like "region of 
>>> (x)", or "region of (*y)" and that would be enough.
>>
>> But how do you type then the assignment example?
>>
>> void assign(int** p, int * r) { *p = *r; }
>>
>> How do you reflect the requirement that r's region outlives *p's region?
>>
>> But that's not even the point. Say you define some notation, such as:
>>
>> void assign(int** p, int * r) if (region(r) <= region(p));
>>
>> But the whole point of regions was to _simplify_ notations like the 
>> above into:
>>
>> void assign(region R)(int*R* p, int *R r);
>>
>> So although you think you simplified things by using region(symbol) 
>> instead of symbolic names, you complicated things. The compiler still 
>> needs to infer regions for each value, so it is as complicated as a 
>> named-regions compiler, and in addition you require the user to write 
>> bulkier expressions because you disallow use of symbols. So everybody 
>> is worse off. Note how in the example using a symbolic region the 
>> outlives relationship is enforced implicitly by using the same symbol 
>> name in two places.
> 
> Everywhere I said there was no need for named regions, I also said named 
> regions could be kept to ease the syntax. That said, I'm not so sure 
> named regions are that good at simplifying the syntax. In your assign 
> example above, the named-region version has an error: it forces the two 
> pointers to be of the same region. That could be fine, but, assuming 
> you're assigning to *p, it'd be more precise to write it like that:
> 
>     void assign(region R1, region R2)(int*R1* p, int*R2 r) if (R1 <= R2);

No, the code is correct as written (without the if). You may want to 
reread the paper with an eye for region subtyping rules. This partly 
backs up my point: understanding region analysis may be quite a burden 
for the average programmer. Even you, who took pains to think through 
everything and absorb the paper, are having trouble. And me too to be 
honest :o).

> Once we get there, I think the no-named region syntax is better.

This is invalidated by the wrong assertion above.

> That 
> said, for the swap example, where both values need to share the same 
> region, the named region notation is simpler:
> 
>     void swap(region R)(int*R a, int*R b);
>     void swap(int* a, int* b) if (region(a) == region(b));

No, for that swap there is no need to specify any region. You can swap 
ints in any two regions. Probably you meant to use int** throughout.

> But I'd argue that most of the time regions do not need to be equal, but 
> are subset or superset of each other, so reusing variable names makes 
> more sense in my opinion.

Don't forget that using a region name twice may actually work with two 
different regions, so far as they are in a subtyping relationship. 
Region subtyping is key to both simplifying code and to understanding 
code after simplification.

> In any case, I prefer a notation where regions constrains are attached 
> directly to the type instead of being expressed somewhere else. 
> Something like this (explained below):
> 
>     void assign(int*(r)* p, int* r) { *p = r; }
>     void swap(ref int*(b) a, ref int*(a) b);

Sure. I'm sure there's understanding that that doesn't make anything any 
simpler or any easier to implement or understand. It's just a minor 
change in notation, and IMHO not to the better.

> Here, a parenthesis suffix after a pointer indicates the region 
> constrain of the pointer, based on the region of another pointer.

I thought it means pointer to function. Oops.

> In the 
> first example, int*(r)* means that the integer pointer "*p" must not 
> live beyond the value pointed by "r" (because we're going to assign "r" 
> to "*p"). In the second example, the value pointed by "a" must not live 
> longer than the one pointed by "b" and the value pointed by "b" must not 
> live longer than the one pointed "a"; the net result is that they must 
> have the same lifetime and need to be in the same region.
> 
> For something more complicated, you could give multiple commas-separated 
> constrains:
> 
>     void choose(ref int*(a,b) result, int* a, int* b)
>     {
>         result = rand() > 0.5 ? a : b;
>     }

This all is irrelevant. You essentially change the syntax. Syntax is, 
again, the least of the problems to be solved.

>> I suspect there are things you can't even express without symbolic 
>> regions. Consider this example from Dan's slides:
>>
>> struct ILst(region R1, region R2) {
>>      int *R1 hd;
>>      ILst!(R1, R2) *R2 tl;
>> }
>>
>> This code reflects the fact that the list holds pointer to integers in 
>> one region, whereas the nodes themselves are in a different region. It 
>> would be a serious challenge to tackle that without symbolic regions, 
>> and simpler that won't be for anybody.
> 
> Today's templates are just fine for that. Just propagate variables 
> through template arguments and apply region constrains to the members:
> 
>     struct ILst(alias var1, alias var2) {
>         int*(var1) hd;
>         ILst!(var1, var2)*(var2) tl;
>     }
>     
>     int z;
>     int*(z) a, b;
>     ILst!(a, b) lst1;
>     ILst!(&z, &z) lst2;

I hope you agree that this is just written symbols without much meaning. 
This is not half-baked. It's not even rare. The cow is still moving. I 
can't eat that! :o) I can't even start replying to it because there are 
so many actual and potential issues, I'd need to get to work on them first.

> We could even allow regions to propagate through type arguments too:
> 
>     struct ILst2(T1, T2) {
>         int*(T1) hd;
>         ILst2!(T1, T2)*(T2) tl;
>     }
>     ILst2!(typeof(&z), typeof(b)) lst3;
> 
> I think this example is a good case for attaching region constrains 
> directly to types instead of expressing them as conditional expressions 
> elsewhere, as in "if (region a <= region b)".

I am thoroughly lost here, sorry. I can't even answer "this is so wrong" 
or "this is pure genius". Probably it's somewhere in between :o). At any 
rate, I suggest you develop a solid understanding of Cyclone if you want 
to build something related to it.

[In the interest of coherence I snipped away unrelated parts of the 
discussion.]

>> I'm sorry about how you feel. Now we're in a conundrum of sorts. You 
>> seem to strongly believe you can make some nice simplified regions 
>> work, and make people like them. Taking that to a proof is hard. The 
>> conundrum is, you are facing the prospect of putting work into it and 
>> creating a system that, albeit correct, is not enticing.
> 
> Currently, I'm just trying to convince you (and any other potential 
> silent listeners) that it can work.

I understand I've been blunt throughout this post, but please side with 
me for a minute. I'm doing so for the following reasons: (a) I'm 
essentially writing this post in negative time; (b) I believe you 
currently don't have an attack on the problem you're trying to solve; 
(c) I believe it's worthwhile for you to develop an attack on the 
problem, (d) I think "we" = "the D community" should seriously consider 
safety and consequently things like region analysis.

You can now stop siding with me and side again with yourself. At this 
point you can easily guess that all of the above was to prepare you for 
an even blunter comment. Here goes.

You say you want to convince people "it can work". But right now there 
is no "it". You have no "it". Much less an "it" that can work.

But there is of course good hope that an "it" could emerge, and I 
encourage you to continue working towards that goal. It's just a lot 
more work than it might appear.

Andrei