My Reference Safety System (DIP???)

Thu Feb 26 17:25:32 PST 2015

On Thursday, 26 February 2015 at 21:33:53 UTC, Marc Schütz wrote:
> On Thursday, 26 February 2015 at 17:56:14 UTC, Zach the Mystic 
> wrote:
>> On Wednesday, 25 February 2015 at 21:26:33 UTC, Marc Schütz 
>> wrote:
> struct A {
>     B* b;
>     ~this() {
>         b.doSomething();
>     }
> }
>
> struct B {
>     void doSomething();
> }
>
> void foo() {
>     A a;      // declscope(1)
>     B b;      // declscope(1)
>     a.b = &b; // refscope(1) <= declscope(1): OK
>     // end of scope:
>     // `b` is destroyed
>     // `a`'s destructor is called
>     // => your calling a method on a destroyed object
> }
>
> Basically, every variable needs to get its own declscope; all 
> declscopes form a strict hierarchy (no partial overlaps).

Well, technically you only need one per variable with a 
destructor. Fortunately, this doesn't seem hard to add. Just 
another few bits, allowing as many declarations with destructors 
as seem necessary (4 bits = 15 variables, 5 bits = 31 variables, 
etc.), with the last being treated conservatively as unsafe. (I 
think anyone declaring 31+ variables with destructors in a 
function, and taking the addresses of those variables has bigger 
problems than memory safety!)

>> I guess this is the "Language versus Legacy" issue. I think 
>> D's strength is in it's language, not its huge legacy 
>> codebase. Therefore, I find myself going with the 
>> #pleasebreakourcode crowd, for the sake of extending D's lead 
>> where it shines.
>
> I'm too, actually, but it would be a really hard sell.

But look, Walter and Andrei were fine with adding `return ref` 
parameters. There's hope yet!

>> I'm not sure all references in safe code need to be `scope` - 
>> that would break a lot of code unto itself, right?
>
> Not sure how much would be affected. I actually suspect that 
> most of it already behaves as if it were scope, with the 
> exception of newly allocated memory. But those should ideally 
> be "owned" instead.
>
> But your right, there still needs to be an opt-out possibility, 
> most likely static.

I don't even have a use for `scope` itself in my proposal. The 
only risk I'm running is a lot of false positives -- safe 
constructs which the detection mechanism conservatively treats as 
unsafe because it can't follow the program logic. Still, it's 
hard for me to imagine even these appearing very much. And they 
can be put into @trusted lambdas -- all @trusted functions are 
treated as if they copy no references, effectively canceling any 
parameter attributes to the contrary.

>> T* fun(T* a, T** b) {
>>  T* c = new T;
>>  c = a;
>>  *b = c;
>>  return c;
>> }
>
> Algorithm for inference of ref scopes (= parameter annotations):
>
> 1) Each variable, parameter, and the return value get a ref 
> scope (or ref depth). A ref scope can either be another 
> variable (including `return` and `this`) or `static`.
>
> 2) The initial ref scope of variables is themselves.

Actually, no. The *declaration* scope is themselves. The initial 
ref scope is whatever the variable is initialized with, or just 
null if nothing. We could even have a bit for "could be null". 
You might get some null-checking out of this for free. But then 
you'd need more attributes in the signature to indicate "could be 
null!" But crashing due to null is not considered a safety issue 
(I think!), so I haven't gone there yet.

> 3) Each time a variable (or something reachable through a 
> variable) is assigned (returning is assignment to the return 
> value), i.e. for each location in the function that an 
> assignment happens, the new scope ref will be:
>
> 3a) the scope of the source, if it is larger or equal to the 
> old scope

If scope depth is >= 1, you inherit the maximum of the source and 
the target. If it's 0, you do a bitwise OR on the mystery scopes 
(unless the compiler can easily prove it doesn't need to), so you 
can accumulate all possible origins of the assigned-to scope.

> 3b) otherwise (for disjunct scopes, or assignment from smaller 
> to larger scope), it is an error (could potentially violate 
> guarantees)

I don't have "disjunct scopes". There's just greater than and 
less than. The mystery scopes are for figuring out what the 
parameter attributes are, and in the absence of inference, 
causing errors in safe code for the parameters not being 
accurately marked.

> 4) If a source scope refers to a variable (apart from the 
> destination itself), for which not all assignments have been 
> processed yet, it is put into a queue, to be evaluated later. 
> For code like `a = b; b = a;` there can be dependency cycles. 
> Such code will be disallowed.

No, my system is simpler. I want to make this proposal appealing 
from the implementation side as well as from the language side. 
You analyze the code in lexical order:

T* dum(T* a) {
   T* b = a; // b accumulates a
   return b; // okay... lexical ordering, b has a only
   T c;
   b = &c; // now b accumulates scopedepth(1);
   return b; // error here, but *only* here
}

The whole process relies on accumulating the scopes as the 
compiler encounters them. There are cases of branching 
conditional, combined with goto labels, or the beginnings of 
loops, where the logical order could be different from the 
lexical order. Only *these* cases are pushed onto an array and 
revisited when the branching conditional is complete. Because 
it's more likely (possibly mathematically certain) to catch all 
problems, these statements are "reheated" in reverse order. My 
reasoning for this is to keep compiler passes to a minimum, to 
save compilation time. In theory, all the scope assignments could 
be traversed again and again, until no scope was left unturned, 
so to say, but I wanted to come up with something with what you 
call an O(1) compilation time.

Honestly, it's almost impossible to say what the tax in 
compilation time will be until something's implemented (something 
I learned from Walter).

> How exactly the scope of a complex expression has to be 
> computed is left open here.

If you call a function, the return value (if a reference) will 
have a scope which can be deduced from the function signature. 
You inherit the scope of what you pass accordingly, and pass 
those scopes on to the next function (if you're in a function 
chain), or the "out!" parameters, if need be:

T* fun(return T* a, T* b, out!b T** c); // signature only

void gun() {
   T e; // local
   T* f;
   T** g = new T*;
   f = fun(&e, f, g); // f inherits scope of(&e), g inherits f
}

The results of a called function are just inherited as indicated 
by the function signature. I don't know what other kinds of 
"complex expression" you are referring to.

> In the end, if there was no error, all variables, parameters 
> and the return value will have a minimum reference scope 
> assigned. If that scope is the variable itself, they can be 
> inferred as `scope`. If it is a parameter, that parameter get 
> an `out!identifier` or `return` annotation.

The function's final return scope is used to assign "return" to 
the parameter attributes for the final function signature, in the 
case of attribute inference, and the parameter attributes are 
used to deduce the return scope when the function is called.

> Note that the order in which the "assignments" occur inside the 
> function doesn't matter. This is more restrictive than strictly 
> necessary, but it's certainly ok in most cases, easy to work 
> around when not, and it doesn't require data/control flow 
> analysis.

This is different from my proposal. I aim to just go in lexical 
order, with a little extra work done in when lexical order is 
detected as possibly being different from the logical order (in a 
conditional inside a loop).

> (By the way: inference cannot work for recursive functions.)

I would like to see a "best effort" approach taken for solving 
the problem of recursive function inference. I think a function 
should be considered "innocent until proven guilty" as regards 
'pure', for example. It's one of those things which seems like 
it's really hard to screw up. How could a function which is 
otherwise pure become impure just because it calls itself?

T hun(...) {
   [no impure code]
   hun(...);
   [no impure code]
}

I may be wrong, but I can't figure out how this function could 
magically become impure just because it calls itself. The same 
goes for the other attributes. And you can use the same trick, of 
pushing questionable expressions onto a stack or array, and just 
revisiting them at the end of the function to check for attribute 
violations. But I admit I don't really understand why attributes 
can't be inferred with recursive calls in the general case. Maybe 
somebody can explain to me what I'm missing here.

> Your example:
>
> T* fun(T* a, T** b) {
>     // => S(a) = a
>     // => S(b) = b
>     // => S(return) = <doesn't matter>
>     T* c; // == (T*).init == null
>     // => S(c) = c
>     c = new T;
>     // `new` returns static, which is wider than c

`c's reference hasn't been assigned until now, so it's neither 
wider nor narrower. We're not tracking null references yet, so 
I'm just treating them like they're global.

>     // => S(c) = static
>     c = a;
>     // => invalid, narrowing not allowed
>     // (this is what I asked about, and now I
>     // see why it's necessary)

Actually this is fine, I think. Even if `c` inherited something 
narrower than "new T" (i.e. depth 1), it would be fine, because 
it would now be considered depth(1) and could no longer be copied 
to anything with depth <1. It might or might not store a global, 
but for safety reasons it must now be treated with the narrowest 
it could possibly have. The error now would be if you copied it 
*back* to a parameter or a global. (Difference between `c's 
declaration scope `&c` = (1), and its reference scope = null, 
until otherwise assigned.)

>     // let's assume it didn't happen, so that
>     // the next two statements work
>     *b = c;
>     // => S(b) = S(c) = static
>     return c;
>     // => S(return) = S(c) = static
> }

This would be fine, since your code only has a `new T` and a `T*` 
parameter copied to c so far. In the case of inference, the 
function now infers: "T fun(return T* a, out!a T** b)". In the 
absence of inference, it gives errors on both counts (in @safe 
code of course, as always). And we're not tracking null yet 
(which is a different issue), so I won't worry about that. Also, 
in non-branching code, the compiler could actually know that c 
was no longer null at this time.

> Something else that needs consideration: What happens when 
> parameters alias each other? I think it is ok, because the 
> checking phase will naturally prohibit calling functions in a 
> way that would break the guarantees, but I haven't thought it 
> through completely.

I'm not sure what you mean. I don't think it's a problem.

>> I'm actually thinking of reusing `noscope` as a function 
>> attribute (`@noscope` perhaps) which says that the function 
>> may return a heap or global reference. This is all that's 
>> necessary to complete an ownership system. If a scope has 
>> exactly 1 "mystery" bit set, and is known not to come from the 
>> heap or a global, then you know that it *must* contain a 
>> reference to exactly the parameter for which the mystery bit 
>> is set. You know exactly what it contains == ownership.
>
> I will have to think about this, but I believe you cannot 
> express such concepts as deadalnix's islands, or "const 
> borrowing". But maybe, if we're lucky, I'm wrong :-)

We'll see!