Returning variable-sized stack data

Mon Jul 15 10:34:15 UTC 2024

I mentioned in the monthly meeting how I would like to see a more 
convenient way to return variable-sized data to the stack in D. 
Walter mentioned that he wouldn't like to break the C ABI, which 
is understandable, but you can certainly make this work without a 
different ABI. In fact, you can even return variable-sized data 
to the stack in C:
```c
#include <alloca.h>
#include <stdio.h>

struct A{ int a,b,c,d,e,f,g; };
struct B{ int a,b,c; };

int myFnRetSize(int n){ return n == 0 ? sizeof(struct A) : 
sizeof(struct B); }

void myFn(void* mem, int n){
	if(n == 0){
		struct A a = {1,2,3,4,5,6,7};
		*((struct A*)mem) = a;
	}else{
		struct B b = {1,2,7};
		*((struct B*)mem) = b;
	}
}

int main(){
	int n = 1; //<—— can be any number
	int size = myFnRetSize(n);
	void* mem = alloca(size);
	myFn(mem, n);

	//write out the result:
	for(int i=0; i<size/sizeof(int); i++){
		printf("%d ", ((int*)mem)[i]);
	}
	printf("\n");
	return 0;
}
```
Sorry if the code is terrible, but hopefully it demonstrates my 
point adequately. You might say that this is technically 
returning by reference, but at the machine-code level all stack 
access is done via pointers.

You might be wondering: what the point of having this feature 
would even be?
Well, unions always take as much space as their largest member. 
If a union contains a struct that's (for example) 512 bytes 
large, it will always take 512 bytes, when really we might only 
need to store a 4–8 byte number most of the time. With sumtypes, 
variable-sized stack returns could greatly optimise their stack 
consumption in cases where they have vastly different type sizes, 
with the smaller types being used most frequently.
Some might say reference types should be used for such a purpose; 
but when you're programming a largely data-driven system that 
uses masses of structs, the sheer amount of heap allocations 
could become a huge performance bottleneck, whereas stack 
allocation is practically instant. Of course, you can always 
pre-allocate a huge amount of data onto the stack, but then a lot 
of it will go to waste and your code will be more vulnerable to 
stack-overflows.

A way of making variable-sized stack returns less cumbersome in D 
would be to have some syntactic sugar that works something like 
this:
```d
struct A{ int a,b,c,d,e,f,g; }
struct B{ int a,b,c; }

@stackArrayReturn myFn(int n){
	auto nSqr      = n * n; //demonstrate how variables can affect 
the return value
	auto condition = n * n;
	if(condition == 0){
		return A(nSqr+1,2,3,4,5,6,7);
	}else{
		return B(nSqr,2,7);
	}
}

void main(){
	int n = 1; //<—— can be any number
	void[] myMem = myFn(n);
}
```
Which gets lowered to this:
```d
import std.typecons;
struct A{ int a,b,c,d,e,f,g; }
struct B{ int a,b,c; }

size_t myFn(out return scope void function(void[] memory, 
Tuple!(int,"nSqr") context) callback, out Tuple!(int,"nSqr") 
context, int n){
	auto nSqr      = n * n;
	auto condition = n * n;
	if(condition == 0){
		context = Tuple!(int,"nSqr")(nSqr);
		callback = (void[] m, Tuple!(int,"nSqr") ctx){
			*cast(A*)&m[0] = A(ctx.nSqr+1,2,3,4,5,6,7);
		};
		return A.sizeof;
	}else{
		context = Tuple!(int,"nSqr")(nSqr);
		callback = (void[] m, Tuple!(int,"nSqr") ctx){
			*cast(B*)&m[0] = B(ctx.nSqr,2,7);
		};
		return B.sizeof;
	}
}

void main(){
	int n = 1; //<—— can be any number
	void[] myMem;
	{
		scope void function(void[] memory, Tuple!(int,"nSqr") context) 
__returnCallback;
		Tuple!(int,"nSqr") __returnContext;
		size_t __returnSize = myFn(__returnCallback, __returnContext, 
n);
		import core.stdc.stdlib: alloca;
		myMem = alloca(__returnSize)[0..__returnSize];
		__returnCallback(myMem, __returnContext);
	}
}
```
This is an example of returning one of two different struct 
types, but this could also be used to return slices of any type 
(e.g. `int[]`).
Ideally there would be a nice way to do this with scoped 
delegates, but `alloca` will re-allocate the data that their 
context pointer points to. Another less stack-wasteful (albeit 
potentially significantly more CPU-wasteful) method would be to 
have two completely separate functions. One that determines the 
return size, and another that does all the other logic. However, 
this method would either limit the function's structure 
significantly, or generate wasteful code.

I would love to hear feedback and suggestions for improvements, 
or of other possible implementations for variadic stack returns.