Escape Analysis & Owner Escape Analysis
Richard (Rikki) Andrew Cattermole
richard at cattermole.co.nz
Tue Sep 3 03:00:20 UTC 2024
I've done an almost complete rewrite, I expect this to be close to the
final version:
- Globals only need to be loaded from, after that they may become an owner
- I changed how ``scope`` works, it is no longer a flag, its instead a
relationship strength, establishing a relationship strength is what
changed the majority of the document.
- New attribute ``@move`` inferred, allows for functions like swap and
move to be modelled without you needing to add said attribute (DIP1000
doesn't do any of this).
- Acknowledgement that DIP1000 attributes can live side by side these
ones, meaning the migration from DIP1000 would be a smooth one.
Current:
https://gist.github.com/rikkimax/0c1de705bf6d9dbc1869d60baee0fa81/5dba16b3b1fbe250b31ae237dda2ffc7b66f9399
As I believe this is more less complete, I'll include a copy here.
--------------------------------------------------------------------------------------------------------------
# Escape Analysis
| Field | Value
|
|-----------------|-----------------------------------------------------------------|
| DIP: | (number/id -- assigned by DIP Manager)
|
| Author: | Richard (Rikki) Andrew Cattermole
<firstname at lastname.co.nz> |
| Implementation: | (links to implementation PR if any)
|
| Status: | Draft
|
## Abstract
The movement of pointers within a program graph easily escapes known
points that own that memory within the call stack of a given thread.
This logical error can result in program termination or undetected
corruption of the program state. This proposal is a new attempt at
preventing this corruption within the ``@safe`` code.
## Contents
* [Rationale](#rationale)
* [Prior Work](#prior-work)
* [Description](#description)
* [Breaking Changes and Deprecations](#breaking-changes-and-deprecations)
* [Reference](#reference)
* [Copyright & License](#copyright--license)
* [Reviews](#reviews)
## Rationale
In a review of the existing escape analysis solution implemented in D's
reference compiler DIP1000, there is one major limitation of what it
models and assumption growth to facilitate functionality.
The implementation of DIP1000 models a single output variable per
function, this is the return value or if ``void`` the first parameter
(could be the ``this`` pointer). In practice functions typically have
more than one output, this includes mutable pointers in, ``ref`` and
``out`` function parameters.
```d
int* /* output */ func();
struct S {
int* /* output */ method1();
void method2() /* output */;
}
```
The relationship between parameters is modelled using the ``return ref``
and ``return scope`` attributes. These communicate to the compiler the
varying input and how it relates to the output for that parameter.
Needing two different attributes to determine the relationship status
between parameters has been highly incommunicable to experienced
programmers.
Due to it not being able to model multiple outputs, a lot of typical D
code cannot be safely represented using DIP1000. The design does not
protect you from extending past the modelled subset of the language.
To resolve both of these core issues in the existing design, an escape
set must be modelled per parameter. While this resolves the callee's
side, it does not protect the caller from misusing the callee. The
design DIP1000 attempts to solve this by modelling the relationship
between parameters using the two different attributes.
Another solution to this problem is to utilize the information provided
by escapes and inverse it, given an output and given the inputs that
form it, protect the inputs so that nothing can invalidate the output.
This resulted in the proposal that was ``@live``, an opt-in analysis
that does not communicate to either the callee or caller any guarantees
cross-function, making it functionally irrelevant to the guarantees of
DIP1000.
An opt-in solution to ownership does not allow for reference counting to
occur safely. To safely do this, the referenced counted owner must be
pinned and made effectively read-only so that both a reference to it and
the borrowed resource may be passed around. This was a
[blocker](https://forum.dlang.org/post/v0eu64$23bj$1@digitalmars.com)
determined by Walter and Timon for adding reference counting to the
language.
Furthermore without the entry point to escape analysis having analysis
associated with it, there is no differentiation of what can constitute
of a safe to borrow from source and what can't be. An example of this is
with a global, in the case of a variable thread local storage, it is
possible in fully ``@safe`` code with DIP1000 turned on to cause a segfault.
```d
import std;
int* tlsGlobal;
@safe:
void main() {
tlsGlobal = new int(2);
assert(*tlsGlobal == 2);
toCall();
assert(*tlsGlobal == 2); // Segfault
}
void toCall() {
tlsGlobal = null;
}
```
## Prior Work
Escape analysis as a subject matter is primarily an [analysis of
graphs](https://dl.acm.org/doi/10.1145/320385.320400). How they are
mutated and who owns them at what places. Modelling this can be an
expensive set of operations in the form of data flow analysis. For
sufficient and best experience, a full program analysis is needed with a
full graph of manipulation and logic therein analysed.
Native programming languages do not align themselves to full program
analysis, due to the separate compilation model. D is a native
programming language that uses this model almost exclusively. For this
reason, it cannot use a full program analysis and full program graph
analysis for modelling escaping. Instead, a flattened view of the graph
must be representable inside a function signature.
At the time of this proposal, a solution for escape analysis has been
implemented in the D reference compiler that is commonly referred to by
its DIP number, DIP1000. This does not cover memory ownership
guarantees, instead ``@live`` as an opt-in attribute enables some
localized to the given function guarantees.
In Rust ownership is a [transfer based
system](https://doc.rust-lang.org/book/ch04-01-what-is-ownership.html),
so that only one variable has any ownership of memory. In contrast to D,
where this is modelled and attempting to enforce this would not match
how garbage collected memory would be used. Further
[guarantees](https://doc.rust-lang.org/book/ch04-02-references-and-borrowing.html)
are given, in that when a borrow occurs from an owner, only one mutable
borrow is allowed in a given scope. This complements the ownership
transfer system as it guarantees nobody else has the potential for aliasing.
## Description
This proposal introduces escape analysis, owner escape analysis along
with a way to know if a variable associated with an argument has changed
its value post function call.
What escape analysis and its complement owner escape analysis does, is
it protects against invalidation of memory ownership whilst one or more
borrows exist.
The grammar changes for the new function are described here, removal of
DIP1000 and ``@live`` are done in its own heading ``Removal of Existing
Language Elements``.
```diff
AtAttribute:
+ @ EscapeAttribute
+ @ move
ParameterAttributes:
+ @ EscapeAttribute
+ EscapeAttribute:
+ escape ( Identifiers )
+ escape ( )
+ escape
FuncAttr:
+ FuncAttrMove
+ FuncAttrMove:
+ No
```
The semantic analysis for both analysis, is done at the same time as
they are guarantees provided in complement of each other and do not
exist in isolation. A switch should be provided to disable this analysis
should a use case is required to not perform it in the form of
``--disable-memorysafety``. If it is not set, it will be enabled for a
given edition and above by default. For any edition below this will
include the inferring of ``@escape`` and ``@move`` attribute, however no
errors will be generated for either attribute.
There is some potential for the escape set on a parameter to be
explosive in nature for mangling. At this time no specific mangling
scheme is suggested, but it is allowed in this proposal for one to be
implemented.
### Relationships
An expression is said to have a set of relationships between its inputs
and outputs, with some kind of transformation applied to the inputs to
get the outputs. This is described using the formula ``T(inputs...) =
(outputs...)``.
A function prototype, or function pointer declares this relationship,
without providing the transformation function. An example of a
transformation in the form of an identity function is given:
```d
int* identity(/* has relationship to return */ int* input) => input;
```
The return value of the ``identity`` function is the output and has a
relationship to the ``input`` parameter.
Each relationship of an input to its outputs can be described as having
a strength. These strengths are:
- No relationship
- Weak, comes from input and influences output
- Strong, requires the input to be valid for the output to be valid
For example, to get a strong relationship you can take a pointer to
stack memory:
```d
int input;
/*has a strong relationship*/ int* output = &var;
```
In this example, the variable ``input`` must outlive the variable
``output``, if you don't you will get stack corruption.
Another way you can get a strong relationship is to assign a copy of the
value that is stored in an already strong relationship variable.
```d
/*has a strong relationship*/ int* input = ...;
int* output = var; // has a strong relationship
```
The way to explicitly establish the link between a variable and its
initializer as strong is to annotate it with ``scope``.
```d
scope int* input = ...;
int* output = var; // has a strong relationship
```
These three behaviors of establishing a relationship between two
variables also applies when a containing type is in play:
```d
struct Input {
int* ptr;
int field;
}
Input input1 = ...;
int* output1 = &input1.field; // has a strong relationship between
`output1` and `input1`
scope Input input2 = ...;
int* output2 = input2.ptr; // has a strong relationship between
`output2` and `input2`
```
The default relationship of a function's outputs to its inputs is weak,
unless the argument for a given input has a strong relationship. In the
following example the ``input`` variable has a strong relationship to
the stack. So when it gets the output from the ``identity`` function
call it has a strong relationship to its input too.
```d
scope input = new int;
int* output = identity(input);
// `output` variable has a strong relationship to `input` variable
```
A way to require that a given input has a strong relationship to its
outputs is by marking a function parameter as scope.
```d
int* strongIdentity(/* has relationship to return */ scope int* input)
=> input;
```
Now you can take an input that does not have a strong relationship, and
require the output has a strong relationship to it!
```d
int* input = new int;
int* output = strongIdentity(input);
// `output` variable has a strong relationship to `input` variable
```
A weak relationship between an input and output, does not limit the
output. It only establishes that there is an relationship to be had.
This can be quite useful to composed types like tuples and static arrays:
```d
int* transformation(int* input) {
int*[2] array;
array[0] = input; // `array` has a weak relationship to `input`
array[1] = new int; // GC allocation has no relationships without a
constructor call or initializer to form one
return array[0];
}
```
In the above example, the output value of the function will not be
constrained by the variable ``array``. But the weak relationship from
the ``array`` variable to ``input`` parameter, will be inherited by the
return value, giving the following prototype:
```d
int* transformation(/* has relationship to return */ int* input);
```
Setting up a relationship within a function body is one thing, but to do
it within a function signature? That is much harder. In the following
example, a weak relationship is established where an input pointer is
stored inside an output static array.
```d
void assignElement(ref int*[2] output, /* has a relationship to output
*/ int* input) {
output[0] = input;
output[1] = new int;
}
```
The previous examples in this heading used a raw pointer ``int*`` to
establish relationships, the full list of types that affect the
formation of relationships are:
- Slices: ``T[]``
- Raw pointers: ``T*``
- Associative arrays: ``T[U]``
- Pointer-containing fields or elements:
- Structs
- Unions
- Static arrays
- Tuples
- Any by-ref input parameter or variable will have an implicitly strong
relationship to its output if it is also by-ref.
Other types, that behave as a value type like ``int`` are unaffected and
do not establish a relationship. This means that they may be interacted
with without establishing a relationship.
### Escape set
Previously this proposal has limited the terminology to establishing a
relationship between a given input to its outputs. In this heading the
method for describing this relationship in code is presented.
A new attribute is provided, ``@escape(...)``, within the brackets an
escape set is provided using the following identifiers as elements
within it:
- Nothing
- ``return``
- ``this``, also applies to the context pointer of a delegate.
- ``__unknown``, for exceptions, and globals.
- ``__parameters``, for all parameters, except the current one.
- Any function parameter names.
This attribute may be placed on function parameters and on the function
which is represents the ``this`` pointer.
When the attribute is missing its escape set, it defaults the escape set
to ``@escape(return, this, __unknown, __parameters)``. When the
annotation is missing, this will indicate that it is to be inferred.
```d
alias D = T delegate(@escape U input1, @escape(return) V input2)
@escape(return);
T freeFunction(@escape U input1, @escape(return) V input2);
class C {
T method(@escape U input1, @escape(return) V input2) @escape(return);
}
```
The escape set only applies to types that are pointers. Non-pointers do
not have an escape set, and therefore no ``@escape`` attribute. Function
parameters that are non-pointers will have their attribute removed if it
is specified. The ``this`` pointer is always a pointer type, even for
structs.
For easier reading the empty escape set may be elided for variables that
are marked ``scope``. A non-empty escape set must remain on the variable
and cannot be elided.
```d
void doSomething(@escape() scope int* input);
```
Will become:
```d
void doSomething(scope int* input);
```
When a function pointer or delegate does not have an annotation for an
escape set, it is assumed to be the empty set. This is a safe assumption
thanks to the type system enforcing it.
```d
alias F = int* function(int*);
int* identity(@escape(return) int* input){
return input;
}
F func = &identity; // Error: Variable `func` has type `int*
function(int*)` and cannot be assigned `int* function(@escape(return) int*)`
```
Functions that are ``@trusted`` have their function signatures inferred
for escapes, but will not error within the body or when the body does
not match the signature. For ``@safe`` functions these are inferred but
will error within the body and when the signature does not match the
body. Lastly ``@system`` functions will not be analysed for escapes and
any annotation of escapes upon its signature will be ignored.
The compiler has no way to assume what an escape set contains for a
function declaration without a body. To verify it there is an implicit
assumption that the linker will catch it by comparing symbol names with
the help of mangling. To prevent accidental assumptions creeping into
``@safe`` code, any function without a body that is not fully annotated
for the escape sets, will be downgraded to ``@system``. The following
function declaration would be treated as if it wasn't annotated as
``@safe``.
```d
int* someFunction(int*, @escape() int*) @safe;
```
But this will be ``@safe``:
```d
int* someFunction(scope int*, @escape() int*) @safe;
```
Not all ABI's support name mangling of escape sets. By taking the
responsibility of escape annotation requirement off the linker, this
guarantees the compiler is able to provide stronger guarantees for
memory safety analysis without the linker providing a backdoor using
innocuous looking code.
When ``scope`` is placed upon a variable, it requires that when a
variable is converged to not escape into unknown locations. This means
that ``__unknown`` is not allowed to appear in the escape set. This also
applies when a weak relationship parameter is upgraded to strong by the
argument.
```d
void func1(@escape(__unknown) scope int* ptr); // Error the parameter
`ptr` cannot have an escape set that includes `__unknown` and be marked
as having a strong relationship `scope`
void func2(@escape(__unknown) int* ptr);
scope int* ptr;
func2(ptr); // Error variable `ptr` has a strong relation and cannot be
escaped out through a `__unknown` parameter
```
Overriden methods in classes must have an escape set per parameter that
is less than or equal to the parent method's set.
```d
class Parent {
int* method() @escape(return);
}
class Child : Parent {
override int* method() @escape(return, __unknown); // Error: the escape
set for the `this` pointer on `method` must be equal or lesser than the
parent which is `return` not `return, __unknown`
}
```
#### An Input Changed its Known Value
Sometimes an argument will have its value changed from the input. This
is quite important for by-ref parameters who may have its value being
tracked. To indicate to the compiler that it should not consider the
value prior to a call is the same as the one after, the attribute
``@move`` on a parameter will indicate it will have changed. Common
functions that demonstrate this behavior are ``swap`` and ``move``.
```d
T move(T)(@move @escape(return) T input) {
return input;
}
```
At most one escape in the escape set of a parameter, to an output that
has only one input may be used to allow the compiler to track movement
of a given value between function calls.
```d
void swap(T)(@move @escape(input2) ref T input1,
@move @escape(input1) ref T input2) {
T temp = input1;
input1 = input2;
input2 = temp;
}
int* a, b;
swap(a, b);
// Compiler can see that b is in a
// Compiler can see that a is in b
```
All ``out`` parameters will have ``@move`` applied to it automatically
and need not be programmer applied.
If the ``@move`` attribute is applied to a parameter that is not by-ref,
templated or the parameter type does not have move constructors it is an
error.
As an attribute ``@move`` may be inferred if the compiler can see that
the input was changed for a given parameter at the end of the called
function's body.
```d
struct Unique {
int* ptr;
this(/*@move*/ ref Unique other) {
this.ptr = other.ptr;
other = Unique.init; // the input into `other` was changed
}
}
```
#### Analysis
The goal of escape analysis, is to have an accurate accounting of where
inputs go to their outputs and how to converge it between scopes. It
provides protection from false assumptions on lifetimes creeping into
``@safe`` code.
An example of two scopes, whereupon assignment resets the escape set of
an inner variable:
```d
int* outer;
{
int* inner = ...;
outer = inner;
// @escape(outer) inner
// Converge `outer` with any owners of `inner` lifetimes
inner = ...;
// @escape() inner
}
```
When converging on multiple sets instead of taking the minimum set and
erroring, the analysis will take the maximum set of all the scopes:
```d
int* func(int* input) {
if (input is null) {
return new int;
// @escape() input
} else {
return input;
// @escape(return) input
}
// @escape(return) input
}
```
Elements in an array, fields in a class/struct/union are conflated with
the variable that stores them in. Supporting awareness and the
differentiation of each of these cases is not included in this proposal
but a subset coudl be done.
```d
struct S {
int* field;
}
void handle(int* ptr) {
S s;
s.field = ptr;
// @escape(s) ptr
}
```
The point of convergence matters for lifetime analysis. It occurs like
regular function destructor cleanup for a given scope. It happens in
reverse order of the declarations. This has consequences, it allows a
variable that has a strong relationship, to grow its escape set during
its scope, but be a lot smaller at the end.
```d
struct S {
int* field;
}
int* acquire(ref S s) @safe {
return s.field;
}
void caller() @safe {
int x = 2;
S s = S(&x);
*acquire(s) = 3;
}
```
Is equivalent to:
```d
struct S {
int* field;
this(@escape(this) int* field) @safe {
this.field = field;
}
}
int* acquire(@escape(return) ref S s) @safe {
return s.field;
}
void caller() @safe {
int x = 2;
scope xPtr = &x;
// @escape(xPtr) x, escape set cannot grow
S s = S(xPtr);
// @escape(s) xPtr
int* fooReturn = acquire(s);
// @escape(fooReturn) s
*fooReturn = 3;
__cleanup(fooReturn); // Cleanup code from compiler such as
destructors get injected here
// @escape() s
__cleanup(s); // Cleanup code from compiler such as destructors get
injected here
// @escape() xPtr
__cleanup(xPtr); // Cleanup code from compiler such as destructors
get injected here
// @escape() x
__cleanup(x); // Cleanup code from compiler such as destructors get
injected here
// x escape set is empty, therefore ok
}
```
### Owner Escape Analysis
Seeing what variable contributes to another (or becomes), is one thing,
but that does not provide guarantees in of itself. For guarantees to be
made the relationship between variables must be made inversely. This
inverse relationship describes an output variable as being a borrow to
one or more owner input variables.
To establish a borrow, a variable must have one or more relationships to
it that are strong.
```d
int owner;
int* borrowed = &owner;
// `borrowed` has a strong relationship to `owner`
```
Function calls:
```d
int* identity(/*@escape(return)*/ int* input) {
return input;
}
int owner;
int* borrowed = identity(&owner); // Due to `&owner` `borrowed` has a
strong relationship to `owner`
```
Borrowed memory is only ever valid, as long as the owners are not
mutated. Mutation of the owners could unmap the borrowed memory, or
change it in such a way that the program becomes corrupted. When a
borrow is seen, the compiler protects the owner from mutation by
requiring it to be "effectively const" as long as borrows exist. It
cannot be assigned to, or be passed to methods or functions mutably.
```d
struct Top {
int* field;
}
void func(ref Top owner) @safe {
int* field = owner.field;
// owner is now effectively const, it cannot be mutated
owner = Top.init; // Error: The variable `owner` has a borrow and
cannot be mutated
owner.field = null; // Error: The variable `owner` has a borrow and
cannot be mutated
if (field !is null) {
writeln(*field);
*field = 2; // ok, fully mutable
}
}
```
When converging between multiple scopes, the borrowed variables must
have the same value in it.
```d
int owner;
int* borrowed;
if (random() > 0.5) {
borrowed = &owner;
} else {
}
// Error: Variable `borrowed` has two different values in it, it can be
owned by `owner` and be null
```
Side effects from method calls must be prevented, otherwise it will be
possible to invalidate a borrow unknowingly. An existing language
element for this is for checking against mutability, whereby mutable is
disallowed but non-mutable allowed.
```d
struct S {
int field;
@safe:
bool isNull() const {
return false;
}
void makeNull() {
}
}
S s;
int* field = &s.field;
writeln(s.isNull); // ok
s.makeNull(); // Error: Variable `s` has a borrow and may not be mutated
by calling `makeNull`.
```
The attribute ``@move`` indicates that a function call will mutate the
input, and therefore if there are borrows from that variable to error.
```d
void someConsumer(@move scope ref int* input);
int* owner = ...;
int** borrowed = &owner;
someConsumer(owner); // Error: Variable `owner` has a borrow and cannot
be moved into the parameter as it would invalidate the borrows
```
#### Global Variables
Not all variables can be tracked throughout a program's lifecycle.
Global variables including those in thread local storage, can appear in
any point in the call stack multiple times. Pinning of specific values
into existance cannot occur for a global for this reason. It can be
changed out from under you with no way to prevent it in ``@safe`` code.
Loading a value that is a pointer (including structs with pointer
fields), into another will apply a flag onto that variable to say it
contains global memory. This corresponds with the ``__unknown``
relationship argument.
```d
int* global;
void func() {
int* ptr = global;
// is a global `ptr`
}
```
This is useful information to have, as it informs any memory that tries
to contribute to it, that it will be escaped out through ``__unknown``
lifetime.
```d
int** global;
void func() {
int** globalPtr = global;
// is a global `globalPtr`
int value;
int* ptr = &value;
// Variable `ptr` is owned by the stack
*globalPtr = ptr; // Error: variable `ptr` which has a shorter lifetime
cannot be placed into globally accessible memory in `globalPtr`
}
```
It isn't limited to a single call frame, it can protect against
cross-function scopes as well.
```d
int** global;
void caller() {
int** globalPtr = global;
// is a global `globalPtr`
int value;
int* ptr = &value;
// Variable `ptr` is owned by the stack
called(globalPtr, ptr); // Error: Variable `ptr` which is owned by the
stack would escape into a longer lifetime memory that is globally
accessible `globalPtr`
}
void called(@escape() int** globalPtr, @escape(globalPtr) int* ptr) {
*globalPtr = ptr;
}
```
### Removal of Existing Language Elements
The language design elements that are being removed are DIP1000 and
``@live``. Together these attempted to do this proposal but in a
non-integrated way that has shown minimal adoption.
```diff
Attribute:
- return
AtAttribute:
- @ live
FuncAttr:
- FuncAttrReturn
- FuncAttrLive
- FuncAttrReturn:
- Nj
- FuncAttrLive:
- Nm
```
No timeline is specified for removal.
## Breaking Changes and Deprecations
DIP1000 will not be able to be turned on at the same time as this proposal.
Any syntax specific (such as ``return`` attribute) to DIP1000 will break.
Any new semantic analysis would only cause errors to be applied to a new
edition and would not affect the base D2 language.
During the transition period from DIP1000 to this proposal, the
attributes from each proposal that is not active do not contribute to
mangling. This enables attributes from each proposal to live side by
side to keep a code base compiling.
## Reference
- [Shape
Analysis](https://en.wikipedia.org/wiki/Shape_analysis_(program_analysis))
(type state & memory escapes)
- [@system variables
DIP](https://github.com/dlang/DIPs/blob/master/DIPs/accepted/DIP1035.md)
- [# Compositional pointer and escape analysis for Java
programs](https://dl.acm.org/doi/10.1145/320385.320400)
- [4.1. What is
Ownership?](https://doc.rust-lang.org/book/ch04-01-what-is-ownership.html)
- [4.2. References and
Borrowing](https://doc.rust-lang.org/book/ch04-02-references-and-borrowing.html)
## Copyright & License
Copyright (c) 2024 by the D Language Foundation
Licensed under [Creative Commons Zero
1.0](https://creativecommons.org/publicdomain/zero/1.0/legalcode.txt)
## History
The DIP Manager will supplement this section with links to forum
discussions and a summary of the formal assessment.
More information about the dip.ideas
mailing list