Feedback Thread: DIP 1040--Copying, Moving, and Forwarding--Community Review Round 1

Sat Mar 6 20:28:08 UTC 2021

On Friday, 5 March 2021 at 12:20:36 UTC, Mike Parker wrote:
> This is the feedback thread for the first round of Community 
> Review of DIP 1040, "Copying, Moving, and Forwarding".

First, very good proposal (in fact, I sent something very similar 
to Andrei, Walter and Atila in 2019, and I hope this had some 
influence over this proposal). For completeness, I will paste 
this email at the end of this reply.

It cover some subtle but important point, notably:
  - The argument is invalid after this move, and is not destructed.
  - Moving an EMO means that the responsibility for destruction of 
that value moves along with it.

This is something I stressed out at the time, because it has 
profound consequences at the ABI level. It is unfortunate that 
this design goal - while achieved - has dropped from the 
rationale. To stress out how important it is, let me contrast 
with C++.

In C++, the caller remains in charge of the destruction of an 
object. This means that structs that are not trivially 
destructible, movable or copyable need to be passed by reference. 
This introduce many indirections in the generated code, but, 
worse, adds a ton of pressure on the alias analysis, which will 
prevent the optimizer to do its job properly.

The second downside is that structs always need to have a "null" 
state. They need to remain destructible, even after the callee 
move the object. This forces the callee to do extra work in order 
to set the moved object into a null state and extra work from the 
caller to check for the null state during destruction. A concrete 
example of this is for instance that a lock which locks a mutex 
in its constructor and unlocks it in its destructor now must 
check for the mutex being nulled in its destruction, introducing 
an unecessary branch in all path that unlock the mutex.

These are serious downside in the C++ way, and this proposal 
should make it explicit in its rationale that both problems must 
be avoided:
  - Ability to pass non POD as value even at the ABI level.
  - Ability to have movable structs without a "null" state.

Another aspect of the proposal that needs improvement is the 
"Last Use" section. the description of this last use is 
incomplete - and maybe doesn't adopt the right approach, because 
listing all scenarii is always going to be a challenge. One such 
scenario is the case of divergent branches:

S s;
if (...) {
     // Even though it is
     fun(s);
} else {
     // ...
}

The way to cover all base is to go for a fixed point analysis. A 
fixed point analysis approach would go ass follow: define a state 
for all local variable, then follow the control flow updating the 
state of said variable. When a merge happen in the control flow, 
verify that both both path through which you arrive at the merge 
point have a different state, then apply a fix and reprocess 
affected branches of the control flow, up to the point all merge 
point are consistent - you reached a fixed point.

Adopting such an approach will ensure the behavior is not only 
defined for current control flow constructs, but also whatever 
might come up in the future. In the pasted email, I go over in 
more detail about what a fixed point analysis could look like for 
this. While it is likely not formal enough to make it into the 
DIP as this, it makes for a good starting point, IMO.

======

Hi both of you,

This is a topic I discussed with Atila and Andrei informally, and 
now I'd to loop you in Walter. If this gains traction, it'll be 
time for a DIP, but I'd really like to gather some feedback 
before, as putting everything in a format that is formal enough 
is seriously time consuming.

As I assume Walter already knows, C++ copy/destruction semantic 
is actually quite an important barrier to optimization. For the 
sake of making the point, I'm going to use unique_ptr and 
shared_ptr as example, and what we imagine equivalent in D would 
look like. We will also consider that the compiler inline 
ctor/dtor and copies.

Let's consider this sample code:

void foo(unique<T> ptr):

unique<T> t = ...;
foo(move(t));

The first thing we consider s that unique<T>, as well as 
shared<T> are not PODs. This is rather obvious, as this is the 
whole point of having them, if we wanted a POD, we'd use a raw 
pointer. In C++, non POD must have an address. in D, they must 
have an address for the ctor to run, but after this, the compiler 
is free to do whatever it wants as D struct are movable by 
default.

An interesting side effect of that fact is that non POD are 
ALWAYS passed by reference in C++, never by value. The compiler 
creates a temporary in the caller in which the copy is stored. 
The code is lowered to something that look like this:

foo(T** ptrref);

T* t = ...;
T* tmp = nullptr;
swap(t, tmp);
foo(&tmp);
if (tmp != null) destroy(tmp);
if (t != null) destroy(t);

The compiler will have no problem figuring out that t is always 
null, and therefore optimize away its destruction, but the tmp 
one remains. A second order problem is that EVERY struct must 
have some sort of 'null' state. While this is fairly obvious what 
it is for a pointer (!) it is the source of many trap when it 
comes to correctness for others types of structs. Imagine a lock, 
for instance, it now has to check every time if the mutex it 
refers to has been nulled or not, introducing a branch every time 
you unlock the mutex.

Now immagine that foo does something like:

void foo(unique<T> ptr) {
   global = move(ptr);
}

Now foo end up having to go through an indirection to get the 
value of ptr, which right there is 3 cycles at best, a cache miss 
at worse. But worse, it MUST store null into ptr that the caller 
will then dutifully check is null and decide to not destroy 
anything. t will do so in 100% of the case, yet the compiler is 
unable to do anything to optimize it away. In addition, store to 
load forwarding has some bizarre perf edge cases on many CPUs 
(that is, loading an address that is still in the store buffer 
and hasn't hit cache yet) so it is something you'd like to see 
optimized.

If we replace the unique<T> by a shared<T> in the example, it 
gets worse because the increment/decrement operation cannot be 
optimized away at all. This result in code where you got to 
sprinkle calls to move all over the place constantly and it's a 
guarantee that you'll forget to put it sometime and end up with a 
ton of unexpected copies.

Here is what I propose.

1/ The ABI for non POD is the exact same as for POD. it means 
that the responsibility of the destruction falls on the the foo 
method in our example, not on the caller. This means it can be 
passed in registers in the case of unique ptr. The obvious gotcha 
is that the adress of the this pointer will be different in foo 
that it was when the copy constructor was called. D provides no 
guarantee on that front anyways, so I think we are good on that 
front.

Another side effect is that we lose the "russian doll" effect of 
construction destruction when things are moved to another 
function. However, by specifying that the copy need to be made in 
LTR order for parameters, and destruction from the last parameter 
to the first, we end up with this remaining true almost all the 
time, which is probably the right tradeof.

In our example, it means that foo can receive the unique<T> in a 
register, then the compiler can see that the value when we exit 
the function is always null and can optimize away the destruction.

2/ Decide when to copy using the following algorithm:

We go over the function body, keeping track of a status per 
lvalue that exist: destroyed, alive or consumed. All parameters, 
globals, or value reached through indirections are considered 
initially alive. All locals are considered destroyed initially, 
then alive once the constructor as run/initial value is set.

When an lvalue is converted into an rvalue, for instance when 
returned, when passed as argument, when assigned to another 
lvalue, etc... the following happens:
  - if it is alive, it is marked consumed.
  - if it is consumed, then we backtrack to the point at which it 
was consumed and insert a copy operation. The lvalue is 
reconsidered alive after the copy, the the algorithm restart from 
there.
  - if it is destroyed, then this is an error (this should 
basically never happen and is likely a compiler bug).

When an lvalue needs to be destroyed, the compiler does as follow:
  - if it is alive, call the desructor.
  - if it is consumed, do nothing.
  - if it is already destroyed, then once again, you probably have 
a compiler bug.

This is relatively straightforward, but there is one more tricky 
situation, is when you reach a "merge" point i the control flow, 
for instance, after an if/then/else construct. In this case, a 
variable can have different state for each codepath. The 
resolution happens as follow:
  - alive/alive : all good.
  - alive/destroyed: error. Some codepath did not initialize the 
variable or you have  compiler bug.
  - alive/consume: backtrack the consume path and insert copies. 
The lvalue marked alive.
  - consume/destroyed: Mark destroyed.

When an lvalue is assigned to, it's previous value is move to a 
temporary and then destroyed immediately after the assignment 
finishes. The temporary also keeps the state so destruction only 
occurs when the variable was alive.

All rvalues must be immediately returned, assigned, passed as 
argument or destroyed by calling the constructor.

Existing optimizations such as RVO naturaly arise out of this 
process, but this is much more generic and powerful, as it 
generate the minimal number of copies possible assuming structs 
are movable, which is what we want.

In case of loops, the state tracked by the algorithm at the end 
of the loop can affect the merge at the top of the loop, so you 
may have to pass over the loop twice. Same wise, backtracking may 
cause you to have to go over some code twice, but if you play 
with the idea you'll figure out that it converge pretty damn 
quickly and that it is not easy to come up with example requiring 
many passes, and it ALWAYS converges, if for no other reason that 
it inserted copy at each point a C++ compiler would.

In practice, this means that

shared<T> t = ...;
foo(t);

Would simple move t's ownership to foo and never generate a copy. 
However, if later one does

shared<T> t = ...;
foo(t);
bar(t);

Then the call to foo will be patched by the compiler to insert a 
copy. In both cases, t's destructor will never be called because 
it is consumed by the end of the control flow, so destruction is 
a noop.

This is really where we need to go IMO. We avoid so many copy by 
basically moving by default that and solution such as a RC 
solution will be significantly better than C++'s, and on par with 
stuff like ARC.

In addition, the same algorithm can be used to ensure that all 
fields are initialized for constructors. The start in the 
destroyed state and must all be in the alive state by the end of 
the constructor's execution.