move+forward as intrinsics, incl. revised forward semantics for perfect forwarding

Sun Oct 13 21:26:24 UTC 2024

On 10/13/24 12:43, kinke wrote:
> IMO we need to make `core.lifetime.{move,forward}` compiler intrinsics, 
> to enable further optimizations that aren't possible with a library 
> solution.
> ...

Thanks for writing this up! I think this is a good starting point, but I 
would make some small tweaks.

> #### Move
> 
> * semantics: move an lvalue to a new rvalue, at a new memory address, 
> 'hijacking' the lvalue resources; the lvalue is reset to T.init (blit, 
> not assignment!) afterwards

Makes sense, though if the compiler can determine that something is a 
last use, it can optimize out the address change.

> * will be complete with move ctor; syntax needs to be decided, but 
> signature is `(ref T)` (yes, must be an explicit ref)

I can see either idea work here. What is most important is that it is in 
fact treated as a constructor.

I guess the benefit of `this(S)` is uniformity with `this(ref S)`, and 
the benefit of `=this(ref S)` or `opMove(ref S)` is that it is obvious 
that the destructor will be called by the caller, potentially much later.

>    * allows to opt out of the default blit (memcpy struct payload), 
> e.g., to fix up interior pointers
>    * move ctor interop with C++ should be doable (just getting the 
> extern(C++) mangle right)
>    * problem: handle/avoid all compiler-implicit moves/blits (would have 
> to call move ctor and dtor now; emplace FTW!)
> * would be nice as intrinsic:
>    * not to have to import `core.lifetime` everywhere and end up with 
> complicated template bloat for a basically trivial operation
>    * potential optimization: elide lvalue reset to T.init and its 
> destruction iff:
>      * it is a local (can skip destruction)
>      * and not used after the move
>      * and the destruction of T.init is a noop (modulo mods to the 
> struct's own payload), so its elision not observable
> ...

Well, as I alluded to earlier, I think in such cases the object should 
just keep its original address and the move constructor does not need to 
be called at all. It reduces to a safe version of `__rvalue` in this case.

> #### When move isn't sufficient: perfect forwarding
> 
> forward must become an intrinsic:
> * for vars with `ref` storage class: as-is, yields the original lvalue
> * non-ref lvalues (NEW semantics): 're-interpret as rvalue' - no move, 
> and accordingly no destruction after forwarding (because the rvalue will 
> already be destructed earlier)
>    * only valid for locals (incl. params), the destruction of other 
> lvalues cannot be skipped
>    * invalid/undefined to access the original lvalue after forwarding it 
> (has been destructed already)

I think it would be better to do a `move`, where the `move` will usually 
be optimized to a safe `__rvalue` as above. I think unsafe `__rvalue` 
should be possible, but not `@safe`.

>    * probably only valid:
>      * as function call argument expressions (glue layer needs to treat 
> it like a frontend-generated temporary, passing it directly by ref)
>      * as assignment right-hand-sides, for move-assign (`dst = forward! 
> src;` => `dst.opAssign(forward!src);`)
>      * as return expressions, for move-constructions (but prefer NRVO if 
> possible, for direct emplace)
> * probably needs to keep template syntax (`forward!x`, not `forward(x)`) 
> for backwards compatibility with druntime template
> 
> Let's take a look at an example:
> ```D
> import core.stdc.stdio;
> import core.lifetime;
> 
> struct S {
>      int x;
> 
>      this(int x) {
>          this.x = x;
>          printf("ctor: %p\n", &this);
>      }
> 
>      this(this) {
>          printf("copy: %p\n", &this);
>      }
> 
>      ~this() {
>          printf("dtor: %p\n", &this);
>      }
> }
> 
> void main() {
>      {
>          auto lval = S(1);
>          printf("lval: %p\n", &lval);
>          const r = bar1(lval);
>          printf("   r: %p\n", &r);
>      }
> 
>      {
>          printf("\nrvalue:\n");
>          const r = bar1(S(2));
>          printf("   r: %p\n", &r);
>      }
> }
> 
> S bar1()(auto ref S s) {
>      printf("bar1: %p\n", &s);
>      return bar2(forward!s);
> }
> 
> S bar2()(auto ref S s) {
>      printf("bar2: %p\n", &s);
>      return bar3(forward!s);
> }
> 
> S bar3()(auto ref S s) {
>      printf("bar3: %p\n", &s);
>      return bar4(forward!s);
> }
> 
> S bar4()(auto ref S s) {
>      printf("bar4: %p, got a ref: %d\n", &s, __traits(isRef, s));
>      return s; // copy parameter lvalue to return value
> }
> ```
> 
> Output with DMD (and GDC), no backend optimizations:
> ```
> ctor: 0x7ffebea26460
> lval: 0x7ffebea26460
> bar1: 0x7ffebea26460
> bar2: 0x7ffebea26460
> bar3: 0x7ffebea26460
> bar4: 0x7ffebea26460, got a ref: 1
> copy: 0x7ffebea263d0
>     r: 0x7ffebea26464
> dtor: 0x7ffebea26464
> dtor: 0x7ffebea26460
> 
> rvalue:
> ctor: 0x7ffebea2647c
> bar1: 0x7ffebea26488
> bar2: 0x7ffebea26424
> bar3: 0x7ffebea263e4
> bar4: 0x7ffebea263a4, got a ref: 0
> copy: 0x7ffebea26358
> dtor: 0x7ffebea263a4
> dtor: 0x7ffebea263e4
> dtor: 0x7ffebea26424
> dtor: 0x7ffebea26488
>     r: 0x7ffebea26478
> dtor: 0x7ffebea26478
> ```
> 
> What we see is that current `core.lifetime.forward` propagates the ref- 
> ness of the parameter, but has to `core.lifetime.move` it in the non-ref 
> case, creating 3 explicit moves + destructions.
> 
> We also see that there are compiler-implicit moves ('optimized', i.e., 
> no reset+destruction of the moved-from value):
> * when passing the `S(2)` rvalue to `bar1` (not sure why, seems like a 
> bug) - note the different addresses of `ctor` and `bar1`
> * for the return values - the addresses of `copy` and `r` diverge 
> (constructed @ 0x7ffebea26358, destructed @ 0x7ffebea26478)
> 
> With LDC, we at least already get perfectly forwarded return values (the 
> addresses of `copy` and `r` are identical):
> ```
> ctor: 0x7ffda922edbc
> lval: 0x7ffda922edbc
> bar1: 0x7ffda922edbc
> bar2: 0x7ffda922edbc
> bar3: 0x7ffda922edbc
> bar4: 0x7ffda922edbc, got a ref: 1
> copy: 0x7ffda922edb8
>     r: 0x7ffda922edb8
> dtor: 0x7ffda922edb8
> dtor: 0x7ffda922edbc
> 
> rvalue:
> ctor: 0x7ffda922eda0
> bar1: 0x7ffda922ed6c
> bar2: 0x7ffda922ed1c
> bar3: 0x7ffda922eccc
> bar4: 0x7ffda922ecc8, got a ref: 0
> copy: 0x7ffda922eda4
> dtor: 0x7ffda922ecc8
> dtor: 0x7ffda922eccc
> dtor: 0x7ffda922ed1c
> dtor: 0x7ffda922ed6c
>     r: 0x7ffda922eda4
> dtor: 0x7ffda922eda4
> ```
> 
> The compiler needs to implement RVO (Return Value Optimization, 
> different to Named-RVO!) to enable perfect forwarding of the return 
> values. In this example, `r` is allocated in `main`, then its address 
> passed and forwarded as hidden pointer all the way to `bar4`, where it 
> gets copy-constructed.
> 
> With the proposed `forward` semantics, we'd get perfect forwarding of 
> the `s` parameters too, without the 3 explicit moves and destructions. 
> The `S(2)` rvalue would be created in `main`, then passed and forwarded 
> directly by ref all the way to `bar4`, where it would get destructed 
> when the `s` param goes out of scope.
> 
> #### Cherry on top: Last-use optimization from DIP 1040
> 
> This would make the compiler automatically `forward` suited lvalues. In 
> the example, we wouldn't have to use a single explicit `forward` in the 
> `barN` trampolines, *and* the copy-construction of the return value in 
> the non-ref version of `bar4` would be optimized to a move-construction 
> (`return forward!s`).

Sounds good, but I think simple cases like this one should be a 
priority. Even if there is no data-flow analysis as advanced as the one 
proposed in DIP1040, I think it is important that there is no copy in 
`bar4`.