VLA in Assembler

Wed Dec 17 08:21:48 PST 2014

On Wednesday, 17 December 2014 at 16:10:40 UTC, Adam D. Ruppe 
wrote:
> On Wednesday, 17 December 2014 at 14:11:32 UTC, Foo wrote:
>> 	asm {
>> 		mov EAX, n;
>> 		mov [arr + 8], ESP;
>> 		sub [ESP], EAX;
>> 		mov [arr + 0], EAX;
>> 	}
>> but that does not work...
>
> That wouldn't work even with malloc.... remember, an integer 
> more than one byte long, so your subtract is 1/4 the size it 
> needs to be! Also, since the stack grows downward, you're 
> storing the pointer to the end of the array instead of the 
> beginning of it.
>
>
> NOTE: I've never actually done this before, so I'm figuring it 
> out as I go too. This might be buggy or otherwise mistaken at 
> points. (Personally, I prefer to use a static array sized to 
> the max thing I'll probably need that I slice  instead of 
> alloca...)
>
>
> Here's some code that runs successfully (in 32 bit!):
>
> void vla(int n) {
> 	int[] arr;
>
> 	asm {
> 		mov EAX, [n];
>                 // the first word in an array is the length, 
> store that
> 		mov [arr], EAX;
> 		shl EAX, 2; // number of bytes == n * int.sizeof
> 		sub ESP, EAX; // allocate the bytes
> 		mov [arr + size_t.sizeof], ESP; // store the beginning of it 
> in the arr.ptr
> 	}
>
> 	import std.stdio;
> 	writeln(arr.length);
> 	writeln(arr.ptr);
>
>         // initialize the data...
> 	foreach(i, ref a; arr)
> 		a = i;
>
> 	writeln(arr); // and print it back out
> }
>
> void main() {
> 	vla(8);
> }
>
>
> This looks right.... but isn't, we changed the stack and didn't 
> put it back. That's usually a no-no. If we disassemble the 
> function, we can take a look at the end and see something scary:
>
>  8084ec6:       e8 9d 6a 00 00          call   808b968 
> <_D3std5stdio15__T7writelnTAiZ7writelnFAiZv>  // our final 
> writeln call
>  8084ecb:       5e                      pop    esi  // uh oh
>  8084ecc:       5b                      pop    ebx
>  8084ecd:       c9                      leave
>  8084ece:       c3                      ret
>
>
>
> Before the call to leave, which puts the stack back how it was 
> at the beginning of the function - which saves us from a random 
> EIP being restored upon the ret instruction - the compiler put 
> in a few pop instructions.
>
> main() will have different values in esi and ebx than it 
> expects! Running it in the debugger shows these values changed 
> too:
>
> before
>
> (gdb) info registers
> [...]
> ebx            0xffffd4f4       -11020
> [...]
> esi            0x80916e8        134813416
>
>
> after
>
> ebx            0x1      1
> esi            0x0      0
>
>
> It popped the values of our array. According to the ABI: "EBX, 
> ESI, EDI, EBP must be preserved across function calls." 
> http://dlang.org/abi.html
>
> They are pushed for a reason - the compiler assumes they remain 
> the same.
>
>
> In this little test program, nothing went wrong because no more 
> code was run after vla returned. But, if we were using, say a 
> struct, it'd probably fault when it tried to access `this`. 
> It'd probably mess up other local variables too. No good!
>
>
> So, we'll need to store and restore the stack pointer... can we 
> use the stack's push and pop instructions? Nope, we're changing 
> the stack! Our own pop would grab the wrong data too.
>
> We could save it in a local variable. How do we restore it 
> though? scope(exit) won't work, it won't happen at the right 
> time and will corrupt the stack even worse.
>
> Gotta do it ourselves - which means we can't do the alloca even 
> as a single mixin, since it needs code added before any return 
> point too!
>
> (There might be other, better ways to do this... and indeed, 
> there is, as we'll see later on. I peeked at the druntime 
> source code and it does it differently. Continue reading...)
>
>
>
>
> Here's code that we can verify in the debugger leaves 
> everything how it should be and doesn't crash:
>
> void vla(int n) {
> 	int[] arr;
> 	void* saved_esp;
>
> 	asm {
> 		mov EAX, [n];
> 		mov [arr], EAX;
> 		shl EAX, 2; // number of bytes == n * int.sizeof
>
>                 // NEW LINE
> 		mov [saved_esp], ESP; // save it for later
>
> 		sub ESP, EAX;
> 		mov [arr + size_t.sizeof], ESP;
> 	}
>
> 	import std.stdio;
> 	writeln(arr.length);
> 	writeln(arr.ptr);
>
> 	foreach(i, ref a; arr)
> 		a = i;
>
> 	writeln(arr);
>
>         // NEW LINE
> 	asm { mov ESP, [saved_esp]; } // restore it before we return
> }
>
>
>
>
> Note that this still isn't quite right - the allocated size 
> should be aligned too. It works for the simple case of 8 ints 
> since that's coincidentally aligned, but if we were doing like 
> 3 bytes, it would mess things up. Gotta be rounded up to a 
> multiple of 4 or 16 on some systems.
>
> hmm, I'm looking at the alloca source and there's a touch of a 
> guard page on Windows too. Check out the file: 
> dmd2/src/druntime/src/rt/alloca.d, it is written in mostly 
> inline asm.
>
> Note the comment though:
>
>  * This is a 'magic' function that needs help from the compiler 
> to
>  * work right, do not change its name, do not call it from 
> other compilers.
>
>
>
>
> So, how does this compare with alloca? Let's make a really 
> simple example to compare and contrast with malloc to make the 
> asm more readable:
>
> import core.stdc.stdlib;
>
> void vla(int n) {
>         int[] arr;
>         arr = (cast(int*)alloca(n * int.sizeof))[0 .. n];
> }
>
>
> Program runs, let's see the code.
>
> 0805f3f0 <_D3vla3vlaFiZv>:
>  805f3f0:       55                      push   ebp
>  805f3f1:       8b ec                   mov    ebp,esp
>  805f3f3:       83 ec 10                sub    esp,0x10
>  805f3f6:       c7 45 f0 10 00 00 00    mov    DWORD PTR 
> [ebp-0x10],0x10
>  805f3fd:       89 45 fc                mov    DWORD PTR 
> [ebp-0x4],eax
>  805f400:       c7 45 f4 00 00 00 00    mov    DWORD PTR 
> [ebp-0xc],0x0
>  805f407:       c7 45 f8 00 00 00 00    mov    DWORD PTR 
> [ebp-0x8],0x0
>  805f40e:       8b 45 fc                mov    eax,DWORD PTR 
> [ebp-0x4]
>  805f411:       50                      push   eax
>  805f412:       c1 e0 02                shl    eax,0x2
>  805f415:       50                      push   eax
>  805f416:       8d 4d f0                lea    ecx,[ebp-0x10]
>  805f419:       e8 e2 01 00 00          call   805f600 
> <__alloca>
>  805f41e:       89 c1                   mov    ecx,eax
>  805f420:       83 c4 04                add    esp,0x4
>  805f423:       58                      pop    eax
>  805f424:       89 45 f4                mov    DWORD PTR 
> [ebp-0xc],eax
>  805f427:       89 4d f8                mov    DWORD PTR 
> [ebp-0x8],ecx
>  805f42a:       c9                      leave
>  805f42b:       c3                      ret
>
>
> Change alloca to malloc:
>
> 0805f3f0 <_D3vla3vlaFiZv>:
>  805f3f0:       55                      push   ebp
>  805f3f1:       8b ec                   mov    ebp,esp
>  805f3f3:       83 ec 0c                sub    esp,0xc
>  805f3f6:       89 45 fc                mov    DWORD PTR 
> [ebp-0x4],eax
>  805f3f9:       c7 45 f4 00 00 00 00    mov    DWORD PTR 
> [ebp-0xc],0x0
>  805f400:       c7 45 f8 00 00 00 00    mov    DWORD PTR 
> [ebp-0x8],0x0
>  805f407:       8b 45 fc                mov    eax,DWORD PTR 
> [ebp-0x4]
>  805f40a:       50                      push   eax
>  805f40b:       c1 e0 02                shl    eax,0x2
>  805f40e:       50                      push   eax
>  805f40f:       e8 0c fc ff ff          call   805f020 
> <malloc at plt>
>  805f414:       89 c1                   mov    ecx,eax
>  805f416:       83 c4 04                add    esp,0x4
>  805f419:       58                      pop    eax
>  805f41a:       89 45 f4                mov    DWORD PTR 
> [ebp-0xc],eax
>  805f41d:       89 4d f8                mov    DWORD PTR 
> [ebp-0x8],ecx
>  805f420:       c9                      leave
>  805f421:       c3                      ret
>
>
> Differences?
>
>
> We can see on line 3 that there's an extra word allocated for a 
> local variable with alloca. It is loaded with the size of the 
> local variables - 0x10. A pointer to that is passed to alloca.
>
> If we go back to the druntime source code:
>
>  *      ECX     address of variable with # of bytes in locals
>  *              This is adjusted upon return to reflect the 
> additional
>  *              size of the stack frame.
>
>
> It is used in that function:
>         // Copy down to [ESP] the temps on the stack.
>         // The number of temps is (EBP - ESP - locals).
>  // snip
>
> sub     ECX,[EDX]       ; // ECX = number of temps (bytes) to 
> move.
>         add     [EDX],ESI       ; // adjust locals by nbytes 
> for next call to alloca()
>  // snip
>         rep                     ;
>         movsd                   ;
>
>
>
>
> So, instead of restoring the stack pointer upon function return 
> like I did, this copies the relevant data that was pushed onto 
> the stack to the new location, so a subsequent pop will find 
> what it expects, then it adjusts the hidden local size variable 
> so next time, it can repeat the process. Cool - that's 
> something my solution wouldn't have done super easily (it 
> totally could, just don't overwrite that variable once it is 
> initialized).
>
>
> I guess there is a better way than I had figured above :)
>
>
>
>
>
> We can use that same trick the compiler did by declaring a 
> local variable and moving the magic __LOCAL_SIZE (see: 
> http://dlang.org/iasm.html ) value into it up front, then 
> calling alloca exactly as the C does. The implementation can be 
> the same as from druntime too.
>
>
> That's why it is a magic function: it needs to put the stack 
> how it expects, somehow. My way was to add a store. The way 
> actually used in druntime is to store the size of the locals in 
> a hidden variable. Either way, if you do an iasm alloca 
> yourself, you'll have to account for it as well.
>
>
> Otherwise, remember to store the right pointer and allocate the 
> right number of bytes and you've got it.

That is an awesome explanation! :)
Thank you for your time, I will experiment with your code.