Dataflow analysis requirements document
Richard Andrew Cattermole (Rikki)
richard at cattermole.co.nz
Sun Jun 15 17:48:11 UTC 2025
A few months ago I wrote up a bit about what I thought the user
experience should be in using a DFA engine tied to the language.
This was originally meant for Walter, and he has read it, I am
now making public which was always intended. It was presented at
a monthly meeting.
Since then I have been working on a DFA engine described in this
document as the fast DFA.
Currently, its scope of work is focused on truthiness and
nullability, but should it succeed in this role more features
would be added.
It has been approved for merging once it is ready, under the
condition that it is self-contained in its implementation, so no
attributes. It will be behind a preview switch. As experimental
code it can be expected for it to fail in code that isn't my own.
For those who have not been watching the progress on Discord here
are a few snippets that currently work:
```d
void loopy7()
{
int* ptr = new int;
foreach (i; 0 .. 2)
{
if (ptr !is null)
int val1 = *ptr; // ok
ptr = null;
}
ptr = new int;
foreach (i; 0 .. 2)
{
if (ptr !is null)
int val1 = *ptr; // ok
int val2 = *ptr; // error
ptr = null;
} // See less than state at testdfa\start.d(262)
// Due to variable `ptr` was non-null and becomes null
}
void loopy8() {
int* ptr = new int;
bool b = ptr !is null;
foreach(i; 0 .. 2) {
assert(b); // outdated consequences
ptr = null; // error
} // Variable `ptr` was assumed due to assert of variable `b`
which gets invalidated at testdfa\start.d(272)
}
void nested1()
{
static void nested2()
{
int* ptr;
int v = *ptr; // Dereference on null `ptr` at
testdfa\start.d(280)
}
int* ptr;
void nested3()
{
int v = *ptr;
}
nested2;
nested3;
}
```
Function calls and inference are not working just yet, and my big
codebase isn't compiling with unknown AST detection turned on.
Here is the document:
# Data Flow Analysis Requirements
# Description
In the following sections, examples are given of requirements
that new data flow analysis-based language features should
attempt to solve for the D programming language. They are not
exhaustive nor does it introduce syntax that must be utilized by
future proposals.
An overview of the following sections:
- Attributes
- Locally provable: annotations are not required for analysis to
be beneficial to all D users.
- Attributes extractable: attributes complicate matters when
inferred.
- Explicit attributes: are useful even for those who don't like
using them.
- Inferration of attributes results in less: when inferred fewer
attributes will be in your code.
- Prove what is true or guarantee it: attributes come in one of
two forms, they exist because something was proven by
inferration, or the guarantee was requested by the user.
- Defaults matter: if you need to opt into catching obvious
things, it isn't right. It does not need to catch everything to
be useful.
- Absence means it was not analyzed: the distinction between an
absent attribute and a provided attribute is useful in itself.
- All or nothing, or something in between: sometimes a feature
should be all on or all off, other times it makes sense to make
it selective.
- Type state analysis: not all operations should be allowed on a
variable, dereferencing a null pointer is inherently incorrect
behaviour that can have significant side effects in the usage of
D.
- Preventing aliasing: the difficulty in aliasing and how
ownership transfer with reference counting offers useful
guarantees.
- Escape analysis: establishes relationships between variables
that enable applying restrictions.
- Relationship strength: a relationship alone doesn't tell you
what restrictions should be placed upon it, they must be
specified in addition.
- Extra information: once established relationships allow for
propagating other information such as an output holds GC memory.
## Attributes
### Locally Provable
With information only available local to a function, there are
plenty of things that can be checked for without the use of
annotations.
For example, we can tell that the variable ``ptr`` is never
assigned to and will be ``null``. Yet in the example, it was
dereferenced. This is a clear bug that could never be corrected.
```d
int* ptr;
*ptr = 2;
```
Catching this may seem like a waste of time for the compiler to
be performing, however, this is a very simple case. Now consider
a 200-line function with branches. How certain are you that
you'll get it right without 100% code coverage?
A benefit to locally provable things is that at no point were
annotations required. But it doesn't end at a simple non-null
check, it can work for other type states!
In the following example the variable ``buffer`` is not
initialized, which means it cannot be read from. At no point did
we need to understand what ``writeln`` does, everything needed
came from the argument expression and what is in it is an error.
```d
ubyte[24] buffer = void;
writeln(buffer[22]);
```
### Attributes Extractable
Inferred attributes do not assist in enabling separate
compilation, instead, they cause numerous issues and potentially
will slow down a build process. Without attributes, it would be
necessary to do a whole program analysis which is not possible
with separate compilation.
Therefore DFA attributes must:
1. Not be part of the mangling.
2. Must be extractable and passed into a new phase of compilation
to apply.
If an attribute is part of mangling, it would take only one
aspect of it to be wrong to make a symbol no longer link. This
may appear as a feature, but it does not aid in encouraging
usage. It could unintentionally mean that inferration could
silently cause breakage without any obvious reason in the source.
Once inferred, an attribute must be extractable and then passable
to the next phase of compilation. Without this both shared and
static libraries (where the source is not available) cannot work
with DFA-based language features without being explicitly
annotated in its entirety.
For shared and static libraries where the entire source code is
available, it matters not that they are in such artifacts. But it
does require that import paths be run through the DFA as well as
the source code being compiled which will slow things down.
To extract the inferred attributes, the use of the .di generator
could be used. However, currently, it runs before semantic
analysis making it incapable of doing this. Furthermore, the
rewriting of D features has no path back through the parse tree
nor can it be fully expressed using D (comma expression and
anonymous classes).
### Explicit Attributes
Regardless of what attributes the compiler can prove to exist,
there are times when you want to make stronger guarantees to
callers, or yourself.
In the following example, the two functions are identical except
the latter guarantees to the caller that the parameter will not
be escaped. Naturally, with inferration, they would both be
annotated as the latter.
```d
void unknownEscaping(int* ptr) {
*ptr = 99;
}
void nonEscapingExplicit(scope int* ptr) {
*ptr = 99;
}
```
This may be needed to assist in separate compilation guarantees
to callers or where inferration may have failed (or not attempted
due to language features like ``@system``).
### Inferration Of Attributes Results In Less
Both attributes and their inferration have some consequences, but
they also offer some nice benefits too.
For example, in a previous example, we showed that ``scope``
being on a parameter for a given function could be inferred.
```d
void unknownEscaping(int* ptr) {
*ptr = 99;
}
```
The compiler in this case could prove that the variable ``ptr``
doesn't escape the function. So the parameter should be
``scope``. But why didn't the user write it? Do they not know
about escape analysis? Do they not want attributes in their code?
It doesn't matter why, it wasn't written, nor should it be
expected of a human being to write what the compiler can prove.
Following from the earlier mention of separate compilation and
how attributes must be extractable, being able to extract the
inferred attributes means that only the undeterminable attributes
(i.e. when function pointers are involved) would need human
writing.
### Prove What Is True Or Guarantee It
When a user of D writes an attribute for code that is being
compiled, it is due to them wanting to guarantee that this
behaviour is true however that may not be the complete story in
the analysis of this aspect of their code.
For example the parameter ``input`` is marked as ``scope`` in
``heapOrStack`` yet this parameter gets escaped into heap memory.
Should this cause an error? No!
```d
void heapOrStack(scope int* input) {
int** heap = new int;
*heap = input;
}
```
In this case, only local information is required to determine
that the ``heap`` variable should be promoted to the stack, and
then limited in its lifetime. What this indicates is that growth
alone does not indicate memory escaping. Escape analysis must
consider the escape sets after convergence, not during it.
Furthermore, if the user never had written the ``scope`` on the
parameter ``input``, the compiler could still prove in this case
that ``input`` never escaped. The difference between having
written it and not, is if the user wanted the guarantee that it
does not, and if the compiler can prove it.
```d
void stillNoEscapes(int* input) {
int** heap = new int;
*heap = input;
}
```
If the user needs to write an attribute when function pointers
are not in play, it is either a bug or the DFA cannot model the
problem (which may not be a bug). Writing an attribute should be
tuned so that those who want the _guarantee_ of that attribute
can do so, but the average person does not _need_ to.
### Defaults Matter
What analysis is applied to a function by default (with no
discussion of editions here), if it does not cover the things
that can be proved locally to a function with a 100% certainty of
incorrect behaviour then the defaults are not correct.
Currently, in D the default is ``@system`` and to opt into
checks. This is not the correct default given the previous
statement.
However within "correct default" is a fairly large range of
behaviors. Changing the default to be ``@safe`` or an
intermediary falls within this.
Local information may not be the only verifiable information in a
function. If the required information is available (either
inferred or explicitly attributed), then errors should be emitted
when incorrect code is detected. But this isn't the complete
story, without this information and without a complete control
flow graph capable DFA, it is not possible to prove anything
related to a variable after it has become unanalyzable.
This leads to an important understanding of the default DFA, it
must have the ability to give up on a variable. This prevents the
requirement of all D code being annotated. As a consequence of
this, it has the potential to be significantly faster than if it
had support for full control flow graph and graph support.
### Abscense Means It Was Not Analyzed
If an attribute exists, this is something that has or will been
proven to be true if succesfully compiled. If an attribute has
not been provided, then the absence must imply it couldn't be
analyzed.
Consider the ``scope`` attribute. It tells the compiler that this
variable when converged successfully at the end of its scope will
have an empty escape set.
This gives us the annotation ``@escape()``.
However this is not enough, we still need a way of
differentiating between a variable potentially escaping into an
unknown location and a variable that hasn't been analyzed.
An annotation that could be used for this is
``@escape(__unknown)``, note how this annotation is functioning
as a set. This gives us three states a variable can be in:
1. No set
2. Set is empty
3. Set contains ``__unknown``
An example of the usefulness of this three-state positive
annotation approach can be seen with ``printf``. As a non-D
function it would not be analyzed by a D DFA and therefore cannot
be called, it would need to be explicitly attributed.
In this example the default DFA as per ``Defaults Matter``
section would give up on and not emit errors for function calls
that are not attributed, allowing the call to ``printf`` to work
in the ``absentAnnotations`` function.
```d
extern(C) int printf(const char*, ...) @trusted;
void absentAnnotations() @safe {
int* ptr;
printf("something %p\n", ptr);
}
```
If either the absence meant something (such as
``@escape(__unknown)``) or the default DFA did not give up, this
code would not compile.
## All or Nothing, or Something in Between
The invasiveness of a language feature greatly impacts the usage
of the language feature. Take ``@live`` as an example, three out
of the 65 respondents to the ``memory safety in D what is your
view?`` survey said that they use it.
As a feature ``@live`` is opt-in on a function basis, with no
transitive guarantees or verification that it is applied to a
graph of functions. This is not what the prevention of aliasing
is meant for.
Prevention of aliasing either for lifetimes or for code
generation purposes is object-specific behaviour, not function.
In both use cases, it would need to be specified per variable and
transitively applied to any location it is transferred into. Not
transitively applying it breaks the guarantee in any use case
where lifetimes or code generation is concerned leading to
minimal adoption.
As a third option to such a feature, it could be applied to all
variables in every function, as an ownership transfer prevention
mechanism this would cause great breakage in D.
What this demonstrates is for each DFA-based language feature, a
choice has to be made as to its invasiveness. Some features may
be best turned on for everything, or a subset of functions.
Others benefit from being on per object in what guarantees they
offer.
## Type State Analysis
For a given object and with that stored within a variable, each
has a type state that dictates accessibility and manipulation
possibilities that can be done with it.
For example, in ``@safe`` code, it should not be possible to read
from an uninitialized variable.
```d
int var = void;
writeln(var); // Error
```
Or to dereference a null pointer in any function:
```d
int* ptr;
writeln(*ptr); // Error
```
To prevent this each variable is given a type state property,
that has one of the following built-in type states (although
there may be user-defined ones):
1. Unreachable: you cannot do anything with it.
2. Reachable: can be written to, which will increase its type
state. May be explicitly opted into `` = void``.
3. Initialized: can be read or written. This is the default in D.
4. Non-null: can be dereferenced.
It is affected scope so that checks can affect the type state
seen within:
```d
int* ptr = ...;
// `ptr` is in initialized type state, not non-null (unknown
nullability)
if (ptr !is null) {
int v = *ptr; // ok
} else {
int v = *ptr; // Error: variable `ptr` is null, it cannot be
dereferenced.
}
```
When variables converge the lowest type state wins.
```d
int* ptr = ...;
// `ptr` is in initialized type state, not non-null (unknown
nullability)
if (ptr !is null) {
}
int v = *ptr; // Error: variable `ptr` is null, it cannot be
dereferenced.
```
Type state analysis utilises variable tracking, not value
tracking in determining what the type state is. Although value
tracking may be used to increase the type state for all variables
that contain an object within scope.
When used, it acts as a first line of defence against program,
data and operating system corruption-level events. As well as the
introduction of unsafe logical behaviour. Its job is to prevent
these events from happening, not prevent them from making things
worse once they have occurred. Without this capability,
especially in a multithreaded environment, it is too easy for
corruption of data that the program is operating on and operating
system level behaviour that can prevent further usage of the
program or machine. If uptime is considered unrequested program
shutdown is a major risk to the capability of a company to use D.
Unknown locations (variables outside of a function), should be
assumed to have the type state initialized, even after a null
check. The reason for this is the intermediary between the check
and further access the value could have changed.
```d
__gshared int* global;
// thread 1
assert(global !is null);
// thread 2
global = null;
// thread 1
int v = *global; // Null dereference!!!
```
The way to resolve this, ideally with syntax sugar is to perform
a load.
```d
if ((auto thing = global) !is null) {
// *thing ok
}
```
The precision of type state analysis may be finer-grained than a
single variable. It may allow for some indirection for instance
static arrays, structs, ``isolated``, ``immutable`` or tuples.
```d
int*[4] vals;
assert(vals[2] !is null);
int v = *vals[2]; // ok
int w = *vals[1]; // Error: variable `vals` entry 1 could be
null, cannot dereference
```
## Preventing Aliasing
Problematic aliasing is when an object is in multiple variables
within a function, without the static analyzer knowing that this
has occurred. This is problematic for code generation and the
performance penalties may not be desired. Therefore prevention
may be required.
In preventing aliasing the use of a language feature like
isolated from Midori may be used. The basic premise is going into
or out of a function takes place with a transfer. A transfer is
when the old variable(s) are destroyed, and the previous value is
copied into the next location.
There are a set of problems associated with preventing aliasing:
1. A transfer must be an atomic transaction, without necessarily
being an atomic instruction. This prevents a destructor, copy, or
move constructor call, see compare and swap atomic instruction
for what this should be implemented as when moving into and out
of unknown locations.
2. All function parameters must be assumed to alias,
``@restrict`` can be used to indicate that this is not the case
but is not implemented by dmd and there is no enforcement on the
caller side.
3. Limited to any type that can be supported by the compare and
swap atomic instruction. Usually, this supports the sizes of one
and two-pointers. Therefore is limited to heap-allocated memory
or stack memory that has been taken a pointer to.
To account for these requirements an ownership transfer mechanism
is required. This mechanism is annotated per object, which can be
done as a type qualifier (such as ``isolated(T)``).
When a transfer occurs, the old variable must be put into its
initialized type state, it must not hold a value. A transfer is
very similar to a move, but it cannot have a function called on
the old variable in response or to assist the move operation.
The compiler must guarantee that no matter how many variables an
object is within, it will be notified only one time that its
end-of-life event has occurred, this is especially important when
using reference counting with isolated.
```d
struct S {
void opRCInc();
void opRCDec();
}
void takesIsolatedIn1(isolated(S*) input) {
isolated(S*) temp = input;
input = typeof(input).init;
temp.opRCDec();
}
void takesIsolatedIn2(isolated(S*) input) {
isolated(S*) temp = input;
// elided as the object in both `temp` and `input`
// temp.opRCDec();
input.opRCDec();
}
```
For this to function it requires value tracking. Both reference
counting and isolated require it to function. For reference
counting it is used for eliding pairs of increments and decrement
calls. For added optimization opportunities when a function call
occurs, the increment happens within the callee, not the caller.
```d
struct S {
void opRCInc();
void opRCDec();
}
// the pair of RC calls may be elided in this example
void rcCallee(S input) {
input.opRCInc();
input.opRCDec();
}
void rcCaller() {
S s = ...;
rcCallee(s);
s.opRCDec();
}
```
Of note is that reference counting establishes safe access and
mutation of memory at runtime rather than at compile time like
isolated does. When paired together isolated guarantees the
reference count will only ever be one or zero. Reference counting
offers a hook for when that count reaches zero. While allowing
multiple references within a function body. Coupled with escape
analysis this offers a flexible method of memory management when
doing data processing that enables the best code generation wrt.
aliasing.
Any access to a variable post-transfer (or a move) should be
prevented. This will likely indicate a logical bug in the code.
However, if a non-null pointer check occurs, then access in that
branch can be allowed.
```d
isolated(int*) obj = new int;
foreach(_; 0 .. 10) {
if (random() > 0.5) {
transfer(obj); // ugh oh! error
} else {
obj = new int;
}
if (obj !is null) {
transfer(obj); // all ok!
}
}
```
Alternatively, type-state analysis can be used for a simpler
implementation. Wherein a transfer puts all variables an object
is in into a type state reachable. Preventing all reads.
## Escape Analysis
The establishment of a relationship between variables is a useful
bit of information to have. With this relationship for each input
to its respective output, it has a strength that determines what
restrictions are placed upon this relationship.
These relationships form a kind of upside-down tree, which can be
seen in the following example:
```d
int* escapeThese(int* a, int* b, int* c, int* d) {
int* middle1 = random() > 0.5 ? a : b;
int* middle2 = random() > 0.5 ? c : d;
int* bottom = random() > 0.5 ? middle1 : middle2;
return bottom;
}
```
In that example, there is only one scope and no interesting
language features that would complicate what the data flow
analysis must support. What it is meant to show is how the tree
affects the cleanup process of a single scope.
Given that the DFA has been run forward, now it's time to do the
backwards cleanup at the end of the scope. It occurs in the
following order:
1. The return value if not annotated received the ``bottom``
variable state, otherwise the annotated state of the return value
must match or exceed that of the ``bottom``variable.
2. ``bottom`` then has its state bubbled up to ``middle1`` and
``middle2``.
3. ``middle2`` then has its state bubbled up to ``c`` and ``d``.
4. ``middle1`` then has its state bubbled up to ``a`` and ``b``.
5. Finally ``a``, ``b``, ``c`` and ``d``. If they are not
annotated then the parameters receive the state from the
variable, otherwise, the annotated state of the parameter must
match or exceed that of the variable.
For multiple scopes, convergence takes place at the end of the
scope with its parent or when other events occur such as
returning from the function.
### Relationship Strength
An important feedback that came from the survey ``memory safety
in D what is your view?`` was that the annotation that DIP1000
uses for escapes is too confusing.
So let's break down what parts there are to the annotation.
1. The escape set: defines the relationship between an input
variable to its outputs.
2. The relationship strength: what establishes the relationship
between input and output, and how that relationship should be
restricted.
Unfortunately, DIP1000 has a limited escape set provided as it
supports the empty escape set ``scope`` by itself, and ``return``
+ ``scope``means that the input escapes into either the return
value or first parameter (which could be the this pointer). There
is no attribute for stating that it escapes into the unknown see
``Abscense Means It Was Not Analyzed`` for why this matters.
To make matters worse, to determine the relationship strength is
based upon the order of the attributes ``return`` and ``scope``.
The syntax this document uses is in the form of ``@escape(output
S, ...)``. Where ``S`` is the strength.
At the minimum, the relationship strengths needed are:
1. Assignment ``=``
2. Taking a pointer to input and storing in output (could be a
field), the input must outlive the output ``&``
3. Protect the input from mutation as long as the output exists,
the input must outlive the output ``^``
When both the input and the output are by-ref, the relationship
will default to taking a pointer ``&``, otherwise the default is
assignment ``=``. This default matters, see ``Defaults Matter``
for why. The goal of having a default for the relationship
strength is an attempt of preventing the need for it to be
specified by the user.
### Extra Information
Relationships between variables facilitate the transmission of
extra information that could be useful in other analyses. For
example, knowing that a variable contains GC memory with no
restrictions placed upon it, means it can be escaped freely.
In the following example, the ``escapableGC`` function has one
parameter, and this parameter contributes to the return value.
There are no other parameters, therefore if the argument to that
parameter is GC, so is the output.
```d
int* escapableGC(@escape(return) int* ptr);
int* someVal = escapableGC(new int);
return someVal; // perfectly safe as `someVal` is GC
```
While this tells us the behaviour of parameters, what if there
are none? Well, we'd need to annotate for a given output this
extra information.
```d
@gcout int* outGC() {
return new int;
}
```
The purpose of this extra information is twofold. It allows for
code that would otherwise be considered unsafe, to be known to be
safe. It helps to prevent bad behaviour without requiring
attributes to go down the call stack which is very invasive.
An example of catching bad behaviour would be preventing TLS
memory from crossing a coroutine's yield point. This helps to
prevent some coloring of functions.
```d
int tls;
@tlsout int* grab() {
return &tls;
}
void co() @async {
int* ptr = grab();
@async return;
int v = *ptr; // Error: the variable `ptr` contains TLS memory
and cannot cross a coroutine yield point due to the possibility
of being on a different thread.
return;
}
```
# References
- [Report: Analysis of survey: Memory safety in D, what is your
view?](https://forum.dlang.org/post/kijxuuxgtyzciuhjanrq@forum.dlang.org)
- [What color is your
function](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/)
More information about the Digitalmars-d
mailing list