Feedback Thread: DIP 1040--Copying, Moving, and Forwarding--Community Review Round 1
deadalnix
deadalnix at gmail.com
Sat Mar 6 20:28:08 UTC 2021
On Friday, 5 March 2021 at 12:20:36 UTC, Mike Parker wrote:
> This is the feedback thread for the first round of Community
> Review of DIP 1040, "Copying, Moving, and Forwarding".
First, very good proposal (in fact, I sent something very similar
to Andrei, Walter and Atila in 2019, and I hope this had some
influence over this proposal). For completeness, I will paste
this email at the end of this reply.
It cover some subtle but important point, notably:
- The argument is invalid after this move, and is not destructed.
- Moving an EMO means that the responsibility for destruction of
that value moves along with it.
This is something I stressed out at the time, because it has
profound consequences at the ABI level. It is unfortunate that
this design goal - while achieved - has dropped from the
rationale. To stress out how important it is, let me contrast
with C++.
In C++, the caller remains in charge of the destruction of an
object. This means that structs that are not trivially
destructible, movable or copyable need to be passed by reference.
This introduce many indirections in the generated code, but,
worse, adds a ton of pressure on the alias analysis, which will
prevent the optimizer to do its job properly.
The second downside is that structs always need to have a "null"
state. They need to remain destructible, even after the callee
move the object. This forces the callee to do extra work in order
to set the moved object into a null state and extra work from the
caller to check for the null state during destruction. A concrete
example of this is for instance that a lock which locks a mutex
in its constructor and unlocks it in its destructor now must
check for the mutex being nulled in its destruction, introducing
an unecessary branch in all path that unlock the mutex.
These are serious downside in the C++ way, and this proposal
should make it explicit in its rationale that both problems must
be avoided:
- Ability to pass non POD as value even at the ABI level.
- Ability to have movable structs without a "null" state.
Another aspect of the proposal that needs improvement is the
"Last Use" section. the description of this last use is
incomplete - and maybe doesn't adopt the right approach, because
listing all scenarii is always going to be a challenge. One such
scenario is the case of divergent branches:
S s;
if (...) {
// Even though it is
fun(s);
} else {
// ...
}
The way to cover all base is to go for a fixed point analysis. A
fixed point analysis approach would go ass follow: define a state
for all local variable, then follow the control flow updating the
state of said variable. When a merge happen in the control flow,
verify that both both path through which you arrive at the merge
point have a different state, then apply a fix and reprocess
affected branches of the control flow, up to the point all merge
point are consistent - you reached a fixed point.
Adopting such an approach will ensure the behavior is not only
defined for current control flow constructs, but also whatever
might come up in the future. In the pasted email, I go over in
more detail about what a fixed point analysis could look like for
this. While it is likely not formal enough to make it into the
DIP as this, it makes for a good starting point, IMO.
======
Hi both of you,
This is a topic I discussed with Atila and Andrei informally, and
now I'd to loop you in Walter. If this gains traction, it'll be
time for a DIP, but I'd really like to gather some feedback
before, as putting everything in a format that is formal enough
is seriously time consuming.
As I assume Walter already knows, C++ copy/destruction semantic
is actually quite an important barrier to optimization. For the
sake of making the point, I'm going to use unique_ptr and
shared_ptr as example, and what we imagine equivalent in D would
look like. We will also consider that the compiler inline
ctor/dtor and copies.
Let's consider this sample code:
void foo(unique<T> ptr):
unique<T> t = ...;
foo(move(t));
The first thing we consider s that unique<T>, as well as
shared<T> are not PODs. This is rather obvious, as this is the
whole point of having them, if we wanted a POD, we'd use a raw
pointer. In C++, non POD must have an address. in D, they must
have an address for the ctor to run, but after this, the compiler
is free to do whatever it wants as D struct are movable by
default.
An interesting side effect of that fact is that non POD are
ALWAYS passed by reference in C++, never by value. The compiler
creates a temporary in the caller in which the copy is stored.
The code is lowered to something that look like this:
foo(T** ptrref);
T* t = ...;
T* tmp = nullptr;
swap(t, tmp);
foo(&tmp);
if (tmp != null) destroy(tmp);
if (t != null) destroy(t);
The compiler will have no problem figuring out that t is always
null, and therefore optimize away its destruction, but the tmp
one remains. A second order problem is that EVERY struct must
have some sort of 'null' state. While this is fairly obvious what
it is for a pointer (!) it is the source of many trap when it
comes to correctness for others types of structs. Imagine a lock,
for instance, it now has to check every time if the mutex it
refers to has been nulled or not, introducing a branch every time
you unlock the mutex.
Now immagine that foo does something like:
void foo(unique<T> ptr) {
global = move(ptr);
}
Now foo end up having to go through an indirection to get the
value of ptr, which right there is 3 cycles at best, a cache miss
at worse. But worse, it MUST store null into ptr that the caller
will then dutifully check is null and decide to not destroy
anything. t will do so in 100% of the case, yet the compiler is
unable to do anything to optimize it away. In addition, store to
load forwarding has some bizarre perf edge cases on many CPUs
(that is, loading an address that is still in the store buffer
and hasn't hit cache yet) so it is something you'd like to see
optimized.
If we replace the unique<T> by a shared<T> in the example, it
gets worse because the increment/decrement operation cannot be
optimized away at all. This result in code where you got to
sprinkle calls to move all over the place constantly and it's a
guarantee that you'll forget to put it sometime and end up with a
ton of unexpected copies.
Here is what I propose.
1/ The ABI for non POD is the exact same as for POD. it means
that the responsibility of the destruction falls on the the foo
method in our example, not on the caller. This means it can be
passed in registers in the case of unique ptr. The obvious gotcha
is that the adress of the this pointer will be different in foo
that it was when the copy constructor was called. D provides no
guarantee on that front anyways, so I think we are good on that
front.
Another side effect is that we lose the "russian doll" effect of
construction destruction when things are moved to another
function. However, by specifying that the copy need to be made in
LTR order for parameters, and destruction from the last parameter
to the first, we end up with this remaining true almost all the
time, which is probably the right tradeof.
In our example, it means that foo can receive the unique<T> in a
register, then the compiler can see that the value when we exit
the function is always null and can optimize away the destruction.
2/ Decide when to copy using the following algorithm:
We go over the function body, keeping track of a status per
lvalue that exist: destroyed, alive or consumed. All parameters,
globals, or value reached through indirections are considered
initially alive. All locals are considered destroyed initially,
then alive once the constructor as run/initial value is set.
When an lvalue is converted into an rvalue, for instance when
returned, when passed as argument, when assigned to another
lvalue, etc... the following happens:
- if it is alive, it is marked consumed.
- if it is consumed, then we backtrack to the point at which it
was consumed and insert a copy operation. The lvalue is
reconsidered alive after the copy, the the algorithm restart from
there.
- if it is destroyed, then this is an error (this should
basically never happen and is likely a compiler bug).
When an lvalue needs to be destroyed, the compiler does as follow:
- if it is alive, call the desructor.
- if it is consumed, do nothing.
- if it is already destroyed, then once again, you probably have
a compiler bug.
This is relatively straightforward, but there is one more tricky
situation, is when you reach a "merge" point i the control flow,
for instance, after an if/then/else construct. In this case, a
variable can have different state for each codepath. The
resolution happens as follow:
- alive/alive : all good.
- alive/destroyed: error. Some codepath did not initialize the
variable or you have compiler bug.
- alive/consume: backtrack the consume path and insert copies.
The lvalue marked alive.
- consume/destroyed: Mark destroyed.
When an lvalue is assigned to, it's previous value is move to a
temporary and then destroyed immediately after the assignment
finishes. The temporary also keeps the state so destruction only
occurs when the variable was alive.
All rvalues must be immediately returned, assigned, passed as
argument or destroyed by calling the constructor.
Existing optimizations such as RVO naturaly arise out of this
process, but this is much more generic and powerful, as it
generate the minimal number of copies possible assuming structs
are movable, which is what we want.
In case of loops, the state tracked by the algorithm at the end
of the loop can affect the merge at the top of the loop, so you
may have to pass over the loop twice. Same wise, backtracking may
cause you to have to go over some code twice, but if you play
with the idea you'll figure out that it converge pretty damn
quickly and that it is not easy to come up with example requiring
many passes, and it ALWAYS converges, if for no other reason that
it inserted copy at each point a C++ compiler would.
In practice, this means that
shared<T> t = ...;
foo(t);
Would simple move t's ownership to foo and never generate a copy.
However, if later one does
shared<T> t = ...;
foo(t);
bar(t);
Then the call to foo will be patched by the compiler to insert a
copy. In both cases, t's destructor will never be called because
it is consumed by the end of the control flow, so destruction is
a noop.
This is really where we need to go IMO. We avoid so many copy by
basically moving by default that and solution such as a RC
solution will be significantly better than C++'s, and on par with
stuff like ARC.
In addition, the same algorithm can be used to ensure that all
fields are initialized for constructors. The start in the
destroyed state and must all be in the alive state by the end of
the constructor's execution.
More information about the Digitalmars-d
mailing list