Escape analysis (full scope analysis proposal)

Fri Nov 14 04:45:44 PST 2008

On 2008-11-13 00:53:50 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> Michel Fortin wrote:
>> Everywhere I said there was no need for named regions, I also said 
>> named regions could be kept to ease the syntax. That said, I'm not so 
>> sure named regions are that good at simplifying the syntax. In your 
>> assign example above, the named-region version has an error: it forces 
>> the two pointers to be of the same region. That could be fine, but, 
>> assuming you're assigning to *p, it'd be more precise to write it like 
>> that:
>> 
>>     void assign(region R1, region R2)(int*R1* p, int*R2 r) if (R1 <= R2);
> 
> No, the code is correct as written (without the if). You may want to 
> reread the paper with an eye for region subtyping rules. This partly 
> backs up my point: understanding region analysis may be quite a burden 
> for the average programmer. Even you, who took pains to think through 
> everything and absorb the paper, are having trouble. And me too to be 
> honest :o).

Ok, I've reread that part and it's true that using Cyclone's subtyping 
rules it'd work fine with only one region name because Cyclone 
implicitly creates two regions from that, the first being a subset of 
the other, just as I wrote explicitly here. But what I missed out was 
one of Cyclone's syntactic construct, not a concept of regions.

... or perhaps we have a different notion of what is a syntax and what 
is a concept?

>> Once we get there, I think the no-named region syntax is better.
> 
> This is invalidated by the wrong assertion above.

Yes and no. It's true that Cyclone's region subtyping makes the syntax 
prettier. On the other side, the programmer has to be aware of how it 
works, and especially aware that changing the order his arguments will 
implicitly change the region relationship between them.

>> That said, for the swap example, where both values need to share the 
>> same region, the named region notation is simpler:
>> 
>>     void swap(region R)(int*R a, int*R b);
>>     void swap(int* a, int* b) if (region(a) == region(b));
> 
> No, for that swap there is no need to specify any region. You can swap 
> ints in any two regions. Probably you meant to use int** throughout.

Hum, you're right, I meant to make these "ref int*".

>> But I'd argue that most of the time regions do not need to be equal, 
>> but are subset or superset of each other, so reusing variable names 
>> makes more sense in my opinion.
> 
> Don't forget that using a region name twice may actually work with two 
> different regions, so far as they are in a subtyping relationship. 
> Region subtyping is key to both simplifying code and to understanding 
> code after simplification.

I'm not convinced that region subtyping is so simple to understand for 
neophytes, especially because you may assume the same region at first 
glance. Cyclone isn't C++, but this region subtyping rule makes me 
think of one of those many little known corners in C++ such as Koenig 
name lookup.

But I consider this just a syntactic issue about how to express regions 
though. And I may be completely wrong about its unintuitiveness.

>> In any case, I prefer a notation where regions constrains are attached 
>> directly to the type instead of being expressed somewhere else. 
>> Something like this (explained below):
>> 
>>     void assign(int*(r)* p, int* r) { *p = r; }
>>     void swap(ref int*(b) a, ref int*(a) b);
> 
> Sure. I'm sure there's understanding that that doesn't make anything 
> any simpler or any easier to implement or understand. It's just a minor 
> change in notation, and IMHO not to the better.

Ok, then we disagree here. I think this notation is better because it 
makes you think about things in term of pointer lifetime vs. the 
pointed data lifetime, which I think is much less abstract than 
variables being part of different regions where some regions encompass 
other regions. It's a shift in perspective from the syntactic approach 
of Cyclone, although under the hood the compiler would do mostly the 
same work.

>> Here, a parenthesis suffix after a pointer indicates the region 
>> constrain of the pointer, based on the region of another pointer.
> 
> I thought it means pointer to function. Oops.

And I though the syntax was the least of your concern right now? :-) 
This probably can't be the final syntax, but I think it makes things 
clear enough talk about about the concepts... for now.

>> In the first example, int*(r)* means that the integer pointer "*p" must 
>> not live beyond the value pointed by "r" (because we're going to assign 
>> "r" to "*p"). In the second example, the value pointed by "a" must not 
>> live longer than the one pointed by "b" and the value pointed by "b" 
>> must not live longer than the one pointed "a"; the net result is that 
>> they must have the same lifetime and need to be in the same region.
>> 
>> For something more complicated, you could give multiple 
>> commas-separated constrains:
>> 
>>     void choose(ref int*(a,b) result, int* a, int* b)
>>     {
>>         result = rand() > 0.5 ? a : b;
>>     }
> 
> This all is irrelevant. You essentially change the syntax. Syntax is, 
> again, the least of the problems to be solved.

Ok then. Let's go to the real problems.

>>> I suspect there are things you can't even express without symbolic 
>>> regions. Consider this example from Dan's slides:
>>> 
>>> struct ILst(region R1, region R2) {
>>>      int *R1 hd;
>>>      ILst!(R1, R2) *R2 tl;
>>> }
>>> 
>>> This code reflects the fact that the list holds pointer to integers in 
>>> one region, whereas the nodes themselves are in a different region. It 
>>> would be a serious challenge to tackle that without symbolic regions, 
>>> and simpler that won't be for anybody.
>> 
>> Today's templates are just fine for that. Just propagate variables 
>> through template arguments and apply region constrains to the members:
>> 
>>     struct ILst(alias var1, alias var2) {
>>         int*(var1) hd;
>>         ILst!(var1, var2)*(var2) tl;
>>     }
>>         int z;
>>     int*(z) a, b;
>>     ILst!(a, b) lst1;
>>     ILst!(&z, &z) lst2;
> 
> I hope you agree that this is just written symbols without much 
> meaning. This is not half-baked. It's not even rare. The cow is still 
> moving. I can't eat that! :o) I can't even start replying to it because 
> there are so many actual and potential issues, I'd need to get to work 
> on them first.

If you mean there aren't any explanation, then you're right that 
explanations were somewhat missing from my last post. Sorry. I guess I 
was too tired to notice the lack of instructions.

Basically you apply the same rules as for the function signatures in 
the preceding function examples. For instance, "int*(var1)" means the 
ht pointer points to an int that lives at least as long as the one 
pointed by var1 (var1 must be an "int*" pointer). This means that you 
can assign the content of var1 to it, or anything else that will live 
at least as long as var1. It also mean you can take its value and place 
it in var1, or any pointer with a shorter life.

Then, we have "ILst!(var1, var2)*(var2)". It's the same rules as the 
first, except that we have a different type beyond the pointer which 
must be valid through var2's lifetime.

The last code snippet shows how to use that template.

    int z;
    int*(z) a, b;
    ILst!(a, b) lst1;
    ILst!(&z, &z) lst2;

Here, we're declaring "int*(z)", which is a pointer to an int whose 
lifetime is equal or longer than the address of z. (ok, there's an 
error here, it should have been "int*(&z)"). And normally, you wouldn't 
explicitly write that, "int*" would be enough: the compiler should 
determine the default constrains automatically.

Then when you instanciate ILst!(a, b), the template will take the 
lifetime of a and b (which is the lifetime of the address of z) and 
apply it to pointers inside the struct.

>> We could even allow regions to propagate through type arguments too:
>> 
>>     struct ILst2(T1, T2) {
>>         int*(T1) hd;
>>         ILst2!(T1, T2)*(T2) tl;
>>     }
>>     ILst2!(typeof(&z), typeof(b)) lst3;

Again, some explanations were missing... Basically, 
region/scoping/lifetime constrains are attached to pointers. Which 
means that propagating a type ought to be enough to propagate the 
lifetime constrains too. "ILst2!(typeof(&z), typeof(b))" is exactly the 
same as "ILst!(&z, b)". ILst takes its constrains from variables while 
ILst2 takes its constrains from types.

But the two previous examples are a little stretched to make the 
concept more similar to Cyclone. With my proposal, you can do much 
better than this.

I think in most cases where you want to propagate constrains, you'll 
want to propagate a type too. If what you want is a linked list, it'd 
be better expressed generically like this:

	struct ListRoot(T) {
		ListNode!(T)* first;
	}
    struct ListNode(T) {
        T hd;
        ILst2!(T)* tl;
    }

	int global;
	void foo() {
		int a;
		ListRoot!(int*) listRoot;
		ListNode!(int*) listNode;
		listRoot.first = &listNode;
		listNode.hd = &a;
		listNode.hd = &global;
	}

Notice how there is absolutely no special annotation here; it's already 
valid template code.

Now, let the compiler apply some defaults according to these rules: 
types declared in local variables will be allowed to point to values of 
their own region, and structs members will be allowed to point to 
values of the same region the struct comes from. Annotated explicitly, 
the default annotations would look like this:

	struct ListRoot(T) {
		ListNode!(T)*(this) first; // pointer to something in the same region as this
	}
    struct ListNode(T) {
        T value; // if T is a pointer, it holds its own region annotations
        ILst2!(T)*(this) next; // pointer to something in the same 
region as this
    }

	int global;
	void foo() {
		int a;
		ListRoot!(int*(&listRoot)) listRoot;
		ListNode!(int*(&listNode)) listNode;
		listRoot.first = &listNode;
		listNode.value = &a;
		listNode.value = &global;
	}

With this scheme, the lifetime of all nodes in the linked list need to 
be equal or longer than the one of the preceding node (normally, they 
will all be equal), and the lifetime of the value pointer is determined 
by the type you give as a template argument to ListRoot and ListNode. 
Therefore, it becomes possible to construct the linked list on the 
stack when the root is on the stack, with no need for explicit 
annotations.

There is still one problem though. If you want to swap two nodes, you 
can't, because there is no guarenty that the lifetime of the "this" 
pointer of a ListNode is equal to lifetime of the "next" pointer. (In 
fact, the next pointer lifetime is longer or equal to the struct 
lifetime). So if we're going to swap or reorder nodes, we'll need a way 
to constrain the "this" pointer against the "next" pointer to create a 
circular reference and thus forcing the two pointers to point to the 
same region... perhaps something like this:

    struct ListNode(T) {
		ListNode*(next) this;
        T value;
        ILst2!(T)*(this) next;
    }

Not a very good syntax though.

>> I think this example is a good case for attaching region constrains 
>> directly to types instead of expressing them as conditional expressions 
>> elsewhere, as in "if (region a <= region b)".
> 
> I am thoroughly lost here, sorry. I can't even answer "this is so 
> wrong" or "this is pure genius". Probably it's somewhere in between 
> :o). At any rate, I suggest you develop a solid understanding of 
> Cyclone if you want to build something related to it.

I'll side with "pure genius", but I also consider myself biased. :-)

>>> I'm sorry about how you feel. Now we're in a conundrum of sorts. You 
>>> seem to strongly believe you can make some nice simplified regions 
>>> work, and make people like them. Taking that to a proof is hard. The 
>>> conundrum is, you are facing the prospect of putting work into it and 
>>> creating a system that, albeit correct, is not enticing.
>> 
>> Currently, I'm just trying to convince you (and any other potential 
>> silent listeners) that it can work.
> 
> I understand I've been blunt throughout this post, but please side with 
> me for a minute. I'm doing so for the following reasons: (a) I'm 
> essentially writing this post in negative time; (b) I believe you 
> currently don't have an attack on the problem you're trying to solve; 
> (c) I believe it's worthwhile for you to develop an attack on the 
> problem, (d) I think "we" = "the D community" should seriously consider 
> safety and consequently things like region analysis.

I don't mind about (a) and I agree about (d).

I'll say that because of my lack of expertise with Cyclone I have some 
difficulty expressing my proposal as a comparaison of what is different 
from Cyclone (it's difficult enough without it). You're the one asking 
for such a comparison and increasing the difficulty. I do not dislike 
the challenge, but I don't think you can take this as a proof that I 
don't understand well the problem I'm trying to solve when I may just 
be mixing some things about the approach taken by Cyclone.

Another thing not helping is that my original proposal has evolved a 
little since the first time I started the "full scope analysis 
proposal" thread. I also revamped the syntax I use to talk about the 
problem (and apparently I should do it again to avoid a conflicts with 
function names). Hunting in previous post the details I leave out in 
the more recent ones doesn't help anyone understanding what I'm talking 
about.

I'm thinking that maybe I should put everything in one document to have 
a coherent proposal that could evolve as a whole instead of one 
scattered on various post between which the syntax I use and some 
concepts have evolved.

> You can now stop siding with me and side again with yourself. At this 
> point you can easily guess that all of the above was to prepare you for 
> an even blunter comment. Here goes.
> 
> You say you want to convince people "it can work". But right now there 
> is no "it". You have no "it". Much less an "it" that can work.
> 
> But there is of course good hope that an "it" could emerge, and I 
> encourage you to continue working towards that goal. It's just a lot 
> more work than it might appear.

I'm pretty sure I hold that "it" just now, or something very near it. 
It's just that it seems I haven't explained it well enough for you (and 
probably anyone) to understand correctly. I should probably write it 
all down in one coherent and more formal document rather than 
scattering all the details over many different posts as half-documented 
concept-name-changing written-too-fast examples.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/