My Reference Safety System (DIP???)

Tue Feb 24 17:12:14 PST 2015

So I've been thinking about how to do safety for a while, and 
this is how I would do it if I got to start from scratch. I think 
it can be harnessed to D, but I'm worried that people will be 
confused by it, or that there might be a show-stopping use case I 
haven't thought of, or that it is simply too cumbersome to be 
taken seriously, but I'll make a DIP when it overcomes these 
three obstacles.

I'm feeding off the momentum built by the approval of DIP25, and 
off of other recent `scope` proposals:
http://wiki.dlang.org/DIP25
http://wiki.dlang.org/User:Schuetzm/scope
http://wiki.dlang.org/DIP69

This system goes farther than either DIP25 or DIP69 towards 
complete safety, but is simpler and easier to implement I (I 
think) than Mark Schutz's and deadalnix's proposal. It is not an 
ownership or reference counting system, but can serve as the 
foundation to one. Which leads to...

Principle 1: Memory safety is indispensable to ownership, but not 
the other way around. Memory safety focuses on all the things 
which *might* happen, and casts a wide net, akin to an algebraic 
union, whereas ownership targets specific things, focuses on what 
*will* happen, and is akin to the algebraic intersection of 
things. I will therefore present the memory safety system first, 
leave grafting an ownership system on top of it for later.

Principle 2: The Function is the key unit of memory safety. The 
compiler must never need to leave the function it is compiling to 
verify that it is safe. This means that no information important 
to safety can be excluded from the signatures of the functions 
that the compiling function is calling. This principle has 
already been conceded in part by Walter and Andrei's acceptance 
of `return ref` parameters in DIP25, which simply implements the 
most common use case where safety is needed. Here I am taking 
this principle to the extreme, in the interest of total safety. 
But speaking of function signatures,

Principle 3: Extra function and parameter attributes are the 
tradeoff for great memory safety. There is no other way to 
support both encapsulation of control flow (Principle 2) and the 
separate-compilation model (indispensable to D). Function 
signatures pay the price for this with their expanding size. I 
try to create the new attributes for the rare case, as opposed to 
the common one, so that they don't appear very often.

Principle 4: Scopes. My system has its own notion of scopes. They 
are compile time information, used by the compiler to ensure 
safety. Every declaration which holds data at runtime must have a 
scope, called its "declaration scope". Every reference type 
(defined below in Principle 6) will have an additional scope 
called its "reference scope". A scope consists of a very short 
bit array, with a minimum of approximately 16 bits and reasonable 
maximum of 32, let's say. For this proposal I'm using 16, in 
order to emphasize this system's memory efficiency. 32 bits would 
not change anything fundamental, only allow the compiler to be a 
little more precise about what's safe and what's not, which is 
not a big deal since it conservatively defaults to @system when 
it doesn't know.

So what are these bits? Reserve 4 bits for an unsigned integer 
(range 0-15) I call "scopedepth". Scopedepth is easier for me to 
think about than lifetime, of which it is simply the inverse, 
with (0) scopedepth being infinite lifetime, 1 having a lifetime 
at function scope, etc. Anyway, a declaration's scopedepth is 
determined according to logic similar that found in DIP69 and 
Mark Schutz's proposal:

int r; // declaration scopedepth(0)

void fun(int a /*scopedepth(0)*/) {
   int b; // depth(1)
   {
     int c; // depth(2)
     {
       int d; // (3)
     }
     {
       int e; // (3)
     }
   }
   int f; // (1)
}

Principle 5: It's always un at safe to copy a declaration scope from 
a higher scopedepth to a reference variable stored at lower 
scopedepth. DIP69 tries to banish this type of thing only in 
`scope` variables, but I'm not afraid to banish it in all @safe 
code period:

void gun() @safe {
   T* t; // t's declaration depth: 1
   T u;
   {
     T* uu = &u; // fine, this is normal
     T tt;
     t = &tt; // t's reference depth: 2, error, un at safe
   }
   // now t is corrupted
}

So you'd have to enclose "t = &tt;" above in a @trusted lambda or 
a @system block. The truth is, it is absurd to copy the address 
of something with shorter lifetime into something with longer 
lifetime... what use would you ever have for it in the 
longer-lived variable? I'm therefore simplifying the system by 
making all instances of this unsafe.

Looking at Principle 5, I realize I forgot:

Principle 6: Reference variables: Any data which stores a 
reference is a "reference variable". That includes any pointer, 
class instance, array/slice, `ref` parameter, or any struct 
containing any of those. For the sake of simplicity, I boil _all_ 
of these down to "T*" in this proposal. All reference types are 
effectively the _same_ in this regard. DIP25 does not indicate 
that it has any interest in expanding beyond `ref` parameters. 
But all reference types are unsafe in exactly the same way as 
`ref` is. (By the way, see footnote [1] for why I think `ref` is 
much different from `scope`). I don't understand the restriction 
of dIP25 to `ref` paramteres only. Part of my system is to expand 
`return` parameter to all reference types.

Principle 7: In this system, all scopes are *transitive*: any 
reference type with double indirections inherits the scope of the 
outermost reference. Think of it this way:

T** grun() {
   T** tpp = new T*; // reference scopedepth(0)
   return tpp; // fine, safe

   static T st; // decl depth(0)
   T* tp = &st; // ref depth(0)
   *tpp = tp;
   return tpp; // safe, all depths still 0

   T t; // decl depth(1)
   tp = &t; // tp reference depth now (1)
   *tpp = &tp; // safe, depths all 1
   return tpp; // un at safe
}

If a reference type contains *any* pointer, no matter how 
indirect, to a local scope, the *whole* type is corrupted when 
the scope finishes.

Principle 8: Any time a reference is copied, the reference scope 
inherits the *maximum* of the two scope depths:

T* gru() {
   static T st; // decl depth(0)
   T t; // decl depth(1)
   T* tp = &t; // ref depth(1)
   tp = &st; // ref depth STILL (1)
   return tp; // error!
}

If you have ever loaded a reference with a local scope, it 
retains that scope level permanently, ensuring the safety of the 
reference.

Whatever your worries about scopedepth, I want to introduce the 
purpose of the other 12 bits in a scope.

I said a scope consisted of 16 bits, and I only used 4 so far. 
What are the other 12 for, then? Simple, we need one bit for each 
of the function's parameters. Let's reserve 8 bits for them. All 
references copied to or from the 8th parameter or above are 
treated as if they copied to *all* of them. Very few functions 
will do this, so we paint them all with a broad brush, for safety 
reasons. (Likewise, all scopedepths above 15 are treated the 
same.)

We have 4 bits left. These are for the "special" parameters: One 
for the implicit `this` parameter of member functions, one bit 
for the context of a nested function, one special bit to 
symbolize access to or from global or heap variables, and one bit 
left over in case I missed something. Remember, the "luxury" 
version would have a whole 32, or even 64 bits to play around 
with, but 16 will suffice in most cases.

Each of the functions parameters is initialized with its own bit 
set. All these bits represent "mystery scopes" -- that is, we 
don't know what their scope is in the calling function, but:

Principle 8: We don't need to know! For all intents and purposes, 
a reference parameter has infinite lifetime for the duration of 
the function it is compiled in. Whenever we copy any reference, 
we do a bitwise OR on *all* of the mystery scopes. The new 
reference accumulates every scope it has ever had access to, 
directly or indirectly.

T* fun(T* a, T* b, T** c) {
   // the function's "return scope" accumulates `a` here
   return a;
   T* d = b; // `d's reference scope accumulates `b`

   // the return scope now accumulates `b` from `d`
   return d;

   *c = d; // now mutable parameter `c` gets `d`

   static T* t;
   *t = b; // this might be safe, but only the caller can know
}

All this accumulation results in the implicit function signature:

T* fun(return T* a, // DIP25
        return noscope T* d, // DIP25 and DIP71
        out!b T** c  // from DIP71
        ) @safe;

(See footnote [2] for a comment on on the `out!` and `noscope` 
attributes.)

Principle 9: When calling a function, DIP25 (expanded to all 
reference types) in combination with DIP71 gives you everything 
you need to know to ensure total memory safety. If we have a 
function signature:

T* gun(return T* a, noscope T* b, out!b T** c) @safe;

T* hun(return T* a1, T** b2) {
   T t;
   T* tp, tp2;
   tp = new T; // depth zero
   tp2 = gun(a1,  // tp2 accumulates a1 based on fun()'s signature
            tp, // okay to copy a new T to a global pointer
            b2); // b2 now loaded with tp's global only scope
   return tp2; // okay, all we have so far is a1, marked `return`

   tp = &t; // tp now loaded with local t's scope
   return gun(tp, // error, gun() inherits tp's local scope
              tp2, // tp2 has a1 only right now
              b2, // error, b2 not marked `out!a1`
}

The point is that there's nothing gun() can do to corrupt hun() 
on its own, since all its exits are blocked.

Principle 10: You'll probably have noticed that all scopes 
accumulate each other according to lexical ordering, and that's 
good news, because any sane person assigns and return references 
in lexical order. The fun part of this proposal is that for 
99.99% of uses the safety mechanism will catch the load ordering 
accurately on the first pass, with hardly any compiler effort. 
It's safe because it accumulates and never loses information. But 
there is a way to break this system, although there are only two 
types of people who would ever do it: malicious programmers 
trying to break the safety system, and fools. This is how you do 
it:

T* what() {
   T t;
   T* yay;
   foreach(i; 1..4) {
     if (i == 3)
       yay = new T;
     else if (i == 2)
       return yay;
     else if (i == 1)
       yay = &t;
   }
}

The good news is that even this kind of malicious coding can be 
detected. The bad news is that checking for this 0.01% of code 
may take up an unfriendly amount of compile time. Here's the way 
I thought of to check even for this malicious code:

The lexical ordering can only be different from the logical order 
of execution when one is inside a branching conditional which is 
inside a "jumpback" situation, where the code can be revisited. A 
jumpback can only occur after a jump label has been found (rare), 
or inside a loop (common). Anytime a reference is copied under 
the potentially dangerous condition, push the statement that 
copied it onto a stack. When the end of the conditional has been 
reached, revisit each statement in reverse order and "reheat" the 
relevant scopes.

Aside from this unfortunate "gotcha", D would be 100% memory safe 
with this system (at least in single-threaded code -- exceptions 
and thread safety different issues I haven't fully thought 
through).

Conclusion

1. With this system as foundation, an effective ownership system 
is easily within reach. Just confine the outgoing scopes to a 
single parameter and no globals, and you have your ownership. You 
might need another (rare) function attribute to help with this, 
and a storage class (e.g. `scope`, `unique`) to give you an error 
when you do something wrong, but the groundwork is 90% laid.

2. Do I realize that it's weird dressing up function parameters 
with so much information about what they do? Yes I do. But I 
think it's important to see what 100% safety would actually look 
like, even if it's rejected on account of being too burdensome. 
And it wouldn't even *be* burdensome if attribute inference were 
made uniform throughout the language. The function signatures 
could then appear dressed up in their full glory typically only 
in compiler generated interface files, and other places where 
programmers, not compilers, wanted them. Anyway, this is my 
reference safety system. Pop it with your needles!

[1] The problems with `ref` come from the fact that it is the 
only storage class which changes the way a program works without 
giving you an error:

void notRef(/*ref*/ int a) { ++a; }
void yesRef(  ref   int a) { ++a; }

void test() {
   int a = 0;
   yesRef(a); // a == 1
   notRef(a); // a still 1
}

Both yesRef() and notRef() are accepted, but it changes what 
happens which one you use. Adding or subtracting any other 
attribute will at most give you an error, but won't silently 
change things. `ref`, an "immutable pointer with value 
semantics," is a complicated beast, a type but not a type. I say 
this because `scope` and its variants are not so complicated. 
`scope` is like most other attributes. All is does is help the 
compiler optimize things and generate errors when misused. Its 
presence or absence will never change what the program actually 
does, and therefore it should not be lumped together with the 
problems associated with `ref`. [End 1]

[2] Since the discussion to DIP71:

http://forum.dlang.org/post/xjhvpmjrlwhhgeqyoipv@forum.dlang.org

...which proposes `out!` and `noscope` parameters as a way of 
warning the caller what is done inside the function, I have 
started to consider the issue of ownership in addition to 
reference safety. I'm not wedded to the name `noscope` in the 
role I proposed for it. Mark Schutz suggested reusing keyword 
`static` instead, to indicate that a reference is copied to a 
global variable. This may be wise, in light of the fact that an 
ownership system may require something like `noscope` for a 
subtly different purpose. But there's no point in discussing 
details unless the whole proposal gains traction first. [End 2]