Escape analysis (full scope analysis proposal)

Wed Nov 12 20:54:57 PST 2008

On 2008-11-12 10:02:02 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> Michel Fortin wrote:
>> On 2008-11-09 10:10:03 -0500, Andrei Alexandrescu 
>> <SeeWebsiteForEmail at erdani.org> said:
>> So my first point is that since we have a garbage collector in D, and 
>> moreover since we're likely to get one heap per thread in D2, we don't 
>> need dynamic regions. The remaining regions are: 1) the shared heap, 2) 
>> the thread-local heap, 3) All the stack frames; and you can't allocate 
>> other stack frames than the current one. Because none of these regions 
>> require a handle to allocate into, we (A) don't need region handles.
>> 
>> We still have many regions. Beside the two heaps (shared, 
>> thread-local), each function's stack frame, and each block within them, 
>> creates a distinct memory region. But nowhere we need to know exactly 
>> which region a function parameter comes from; what we need to know is 
>> which address outlives which pointer, and then we can forbid assigning 
>> addresses to pointers that outlive them. All we need is a relative 
>> ordering of the various regions, and for that we don't need to attach 
>> *names* to the regions so that you can refer explicitly to them in the 
>> syntax. Instead, you could say something like "region of (x)", or 
>> "region of (*y)" and that would be enough.
> 
> But how do you type then the assignment example?
> 
> void assign(int** p, int * r) { *p = *r; }
> 
> How do you reflect the requirement that r's region outlives *p's region?
> 
> But that's not even the point. Say you define some notation, such as:
> 
> void assign(int** p, int * r) if (region(r) <= region(p));
> 
> But the whole point of regions was to _simplify_ notations like the above into:
> 
> void assign(region R)(int*R* p, int *R r);
> 
> So although you think you simplified things by using region(symbol) 
> instead of symbolic names, you complicated things. The compiler still 
> needs to infer regions for each value, so it is as complicated as a 
> named-regions compiler, and in addition you require the user to write 
> bulkier expressions because you disallow use of symbols. So everybody 
> is worse off. Note how in the example using a symbolic region the 
> outlives relationship is enforced implicitly by using the same symbol 
> name in two places.

Everywhere I said there was no need for named regions, I also said 
named regions could be kept to ease the syntax. That said, I'm not so 
sure named regions are that good at simplifying the syntax. In your 
assign example above, the named-region version has an error: it forces 
the two pointers to be of the same region. That could be fine, but, 
assuming you're assigning to *p, it'd be more precise to write it like 
that:

	void assign(region R1, region R2)(int*R1* p, int*R2 r) if (R1 <= R2);

Once we get there, I think the no-named region syntax is better. That 
said, for the swap example, where both values need to share the same 
region, the named region notation is simpler:

	void swap(region R)(int*R a, int*R b);
	void swap(int* a, int* b) if (region(a) == region(b));

But I'd argue that most of the time regions do not need to be equal, 
but are subset or superset of each other, so reusing variable names 
makes more sense in my opinion.

In any case, I prefer a notation where regions constrains are attached 
directly to the type instead of being expressed somewhere else. 
Something like this (explained below):

	void assign(int*(r)* p, int* r) { *p = r; }
	void swap(ref int*(b) a, ref int*(a) b);

Here, a parenthesis suffix after a pointer indicates the region 
constrain of the pointer, based on the region of another pointer. In 
the first example, int*(r)* means that the integer pointer "*p" must 
not live beyond the value pointed by "r" (because we're going to assign 
"r" to "*p"). In the second example, the value pointed by "a" must not 
live longer than the one pointed by "b" and the value pointed by "b" 
must not live longer than the one pointed "a"; the net result is that 
they must have the same lifetime and need to be in the same region.

For something more complicated, you could give multiple 
commas-separated constrains:

	void choose(ref int*(a,b) result, int* a, int* b)
	{
		result = rand() > 0.5 ? a : b;
	}

> I suspect there are things you can't even express without symbolic 
> regions. Consider this example from Dan's slides:
> 
> struct ILst(region R1, region R2) {
>      int *R1 hd;
>      ILst!(R1, R2) *R2 tl;
> }
> 
> This code reflects the fact that the list holds pointer to integers in 
> one region, whereas the nodes themselves are in a different region. It 
> would be a serious challenge to tackle that without symbolic regions, 
> and simpler that won't be for anybody.

Today's templates are just fine for that. Just propagate variables 
through template arguments and apply region constrains to the members:

	struct ILst(alias var1, alias var2) {
		int*(var1) hd;
		ILst!(var1, var2)*(var2) tl;
	}

	int z;
	int*(z) a, b;
	ILst!(a, b) lst1;
	ILst!(&z, &z) lst2;

We could even allow regions to propagate through type arguments too:

	struct ILst2(T1, T2) {
		int*(T1) hd;
		ILst2!(T1, T2)*(T2) tl;
	}
	ILst2!(typeof(&z), typeof(b)) lst3;

I think this example is a good case for attaching region constrains 
directly to types instead of expressing them as conditional expressions 
elsewhere, as in "if (region a <= region b)".

> I'll insert a few more points below in this sprawling discussion.
> 
>> Which makes me think of this:
>> 
>>     struct A { int i; this(); }
>>     ref A foo(ref A a) { return a; }
>> 
>>     ref A bar()
>>     {
>>         foo(A()).i = 1;
>> 
>>         ref A a = foo(A()); // illegal, ref cannot be used outside 
>> function signature
>>         a.i = 1;
>> 
>>         return foo(A()); // illegal ?
>>     }
> 
> foo(A()) is illegal because ref does not bind to an rvalue.

Ah, you're right.

>> Also, I'd like to point out that ref (and out) being storage classes 
>> somewhat hinder me from using them where it makes sense in the 
>> D/Objective-C bridge, since there most functions are instanciated by 
>> templates where template arguments give the type of each function 
>> argument. Perhaps there should be a way to specify "ref" and "out" in 
>> template arguments...
> 
> I agree. Something like that is on the list.

Great!

>> Also, I'd still like to have a non-null pointer type, especially for 
>> clarifying function sigatures. A template can do. If it was in the 
>> language however it be used by more people, which would be better.
> 
> I don't grok this notion "if it's in the language it would be used by 
> more people". How does that come about?

No, I really think it's true that if it is in the language, explained 
right alongside nullable pointers, more people would learn them more 
and use them more. Isn't it this exact notion that made Walter add Ddoc 
and unit tests directly into the language?

> Does it mean templates are at such a high syntactic disadvantage? Maybe 
> we should do something about that then, such as replacing !() with 
> something else :o). If we put it in phobos (which after integration 
> will be usable alongside with tango) could it count as being in the 
> language?

Pointers that shouldn't be null are pretty common, possibly even more 
common that can-be-null pointers, which is why I think it deserves a 
good, short, easy to read and remember syntax. I'd even suggest 
changing the standard syntax for pointer "*" so it only allows non-null 
pointers, and having something else "*?" for nullable ones. This would 
force people into giving more consideration before allowing nullable 
pointers, and the same syntax could apply to objects too.

That said, having a non-nullable pointer in the standard library would 
certainly be better than nothing. And the standard library should make 
use of it everywhere it makes sense. But is a standard-libary solution 
going to work with "extern (C)" functions? I think it'd be sad if it 
didn't, and it would look strange if it did (C functions with template 
arguments!).

>> As I said below, I think many people in this group are already 
>> confortable with using pointers, which may explain why they're not so 
>> interested. Having no one interested in something doesn't necessarly 
>> mean they won't appreciate it when it comes.
> 
> That I totally agree with. It's happened a couple of times with D features.
> 
>> It does, however reduce the incitative for continuing forward. So I 
>> understand why you're backing off, even if it displease me somewhat.
> 
> I'm sorry about how you feel. Now we're in a conundrum of sorts. You 
> seem to strongly believe you can make some nice simplified regions 
> work, and make people like them. Taking that to a proof is hard. The 
> conundrum is, you are facing the prospect of putting work into it and 
> creating a system that, albeit correct, is not enticing.

Currently, I'm just trying to convince you (and any other potential 
silent listeners) that it can work. I haven't given much though about 
the syntax before today as I wanted to clear up the concepts first. But 
now, in part because of your syntactic arguments above, I'm wondering 
if this was the good path to take.

I don't mind much if it never gets into the language, although I'd like 
it very much. I doing it for myself too, to better understand how you 
can document and analyse the region/scope relationship of various 
variables in a program piece by piece.

>> I never said the need for dynamic regions would be rare: I said garbage 
>> collector obsoletes it. If we can justify the need for dynamic regions 
>> later, we can add them back (with all the added complexity it requires) 
>> but I'd try without them first.
> 
> Let's not forget that symbolic regions (for typing purposes) should not 
> be confused with dynamic regions (for efficiency purposes). I agree we 
> can do away with the latter and put them in later if we care. I 
> disagree that dropping symbolic regions simplifies things.

I was under the impression that Cyclone requirement for named regions 
came with its use of dynamic regions, which I now believe was 
incorrect. If I take this example from the paper:

	char?p rstrdup(region_t<p>, const char? s);

you *need* a name for the region handle. Since region handles are there 
for supporting dynamic regions, it therefore follows that you need 
named regions to make things work at all... well here's the catch: you 
need named *region handles* as variables, not necessarily named 
regions, as you could always arrange the syntax so that the returned 
pointer is of the region of the region handle... or something like that.

>> But dynamic arrays *are* pointers, how are you oblivating the need for 
>> them? If you find a solution for dynamic arrays, you'll have a solution 
>> for pointers too.
>> 
>> You could forbid dynamic arrays from refering to stack-allocated static 
>> ones, or automatically dynamically allocate those when they escape in a 
>> dynamic array. And if I were you, whatever you choose for arrays I'd 
>> allow it for pointers too, to keep things consistent. Pointer to heap 
>> objects should be retained in my opinion.
> 
> But a possible path is to make arrays safe and leave pointers for those 
> cases in which efficiency is of utmost importance. With luck, those 
> cases are rare.

"make arrays safe"... by forcing dynamic ones to always be on the heap? 
Or by implementing a full region system that applies only to arrays? 
Obviously it's not the later; the former is the only choice I can see.

And I think you should at least allow pointers to work with heap 
variables in SafeD... otherwise people will work around that by 
creating one-item arrays. :-)

>> But are you reducing the need for pointers or hiding and restricting them?
> 
> Of course - that's the whole point. In fact, I'll insert a small 
> correction: we are reducing the need for pointers BY hiding and 
> restricting them. And that's a good thing. If you can do most of your 
> work with restricted pointers (e.g. ref), then that's a net win.

Whether you can work effectively only with ref or not remains to be seen.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/