VLA in Assembler
Foo via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed Dec 17 08:21:48 PST 2014
On Wednesday, 17 December 2014 at 16:10:40 UTC, Adam D. Ruppe
wrote:
> On Wednesday, 17 December 2014 at 14:11:32 UTC, Foo wrote:
>> asm {
>> mov EAX, n;
>> mov [arr + 8], ESP;
>> sub [ESP], EAX;
>> mov [arr + 0], EAX;
>> }
>> but that does not work...
>
> That wouldn't work even with malloc.... remember, an integer
> more than one byte long, so your subtract is 1/4 the size it
> needs to be! Also, since the stack grows downward, you're
> storing the pointer to the end of the array instead of the
> beginning of it.
>
>
> NOTE: I've never actually done this before, so I'm figuring it
> out as I go too. This might be buggy or otherwise mistaken at
> points. (Personally, I prefer to use a static array sized to
> the max thing I'll probably need that I slice instead of
> alloca...)
>
>
> Here's some code that runs successfully (in 32 bit!):
>
> void vla(int n) {
> int[] arr;
>
> asm {
> mov EAX, [n];
> // the first word in an array is the length,
> store that
> mov [arr], EAX;
> shl EAX, 2; // number of bytes == n * int.sizeof
> sub ESP, EAX; // allocate the bytes
> mov [arr + size_t.sizeof], ESP; // store the beginning of it
> in the arr.ptr
> }
>
> import std.stdio;
> writeln(arr.length);
> writeln(arr.ptr);
>
> // initialize the data...
> foreach(i, ref a; arr)
> a = i;
>
> writeln(arr); // and print it back out
> }
>
> void main() {
> vla(8);
> }
>
>
> This looks right.... but isn't, we changed the stack and didn't
> put it back. That's usually a no-no. If we disassemble the
> function, we can take a look at the end and see something scary:
>
> 8084ec6: e8 9d 6a 00 00 call 808b968
> <_D3std5stdio15__T7writelnTAiZ7writelnFAiZv> // our final
> writeln call
> 8084ecb: 5e pop esi // uh oh
> 8084ecc: 5b pop ebx
> 8084ecd: c9 leave
> 8084ece: c3 ret
>
>
>
> Before the call to leave, which puts the stack back how it was
> at the beginning of the function - which saves us from a random
> EIP being restored upon the ret instruction - the compiler put
> in a few pop instructions.
>
> main() will have different values in esi and ebx than it
> expects! Running it in the debugger shows these values changed
> too:
>
> before
>
> (gdb) info registers
> [...]
> ebx 0xffffd4f4 -11020
> [...]
> esi 0x80916e8 134813416
>
>
> after
>
> ebx 0x1 1
> esi 0x0 0
>
>
> It popped the values of our array. According to the ABI: "EBX,
> ESI, EDI, EBP must be preserved across function calls."
> http://dlang.org/abi.html
>
> They are pushed for a reason - the compiler assumes they remain
> the same.
>
>
> In this little test program, nothing went wrong because no more
> code was run after vla returned. But, if we were using, say a
> struct, it'd probably fault when it tried to access `this`.
> It'd probably mess up other local variables too. No good!
>
>
> So, we'll need to store and restore the stack pointer... can we
> use the stack's push and pop instructions? Nope, we're changing
> the stack! Our own pop would grab the wrong data too.
>
> We could save it in a local variable. How do we restore it
> though? scope(exit) won't work, it won't happen at the right
> time and will corrupt the stack even worse.
>
> Gotta do it ourselves - which means we can't do the alloca even
> as a single mixin, since it needs code added before any return
> point too!
>
> (There might be other, better ways to do this... and indeed,
> there is, as we'll see later on. I peeked at the druntime
> source code and it does it differently. Continue reading...)
>
>
>
>
> Here's code that we can verify in the debugger leaves
> everything how it should be and doesn't crash:
>
> void vla(int n) {
> int[] arr;
> void* saved_esp;
>
> asm {
> mov EAX, [n];
> mov [arr], EAX;
> shl EAX, 2; // number of bytes == n * int.sizeof
>
> // NEW LINE
> mov [saved_esp], ESP; // save it for later
>
> sub ESP, EAX;
> mov [arr + size_t.sizeof], ESP;
> }
>
> import std.stdio;
> writeln(arr.length);
> writeln(arr.ptr);
>
> foreach(i, ref a; arr)
> a = i;
>
> writeln(arr);
>
> // NEW LINE
> asm { mov ESP, [saved_esp]; } // restore it before we return
> }
>
>
>
>
> Note that this still isn't quite right - the allocated size
> should be aligned too. It works for the simple case of 8 ints
> since that's coincidentally aligned, but if we were doing like
> 3 bytes, it would mess things up. Gotta be rounded up to a
> multiple of 4 or 16 on some systems.
>
> hmm, I'm looking at the alloca source and there's a touch of a
> guard page on Windows too. Check out the file:
> dmd2/src/druntime/src/rt/alloca.d, it is written in mostly
> inline asm.
>
> Note the comment though:
>
> * This is a 'magic' function that needs help from the compiler
> to
> * work right, do not change its name, do not call it from
> other compilers.
>
>
>
>
> So, how does this compare with alloca? Let's make a really
> simple example to compare and contrast with malloc to make the
> asm more readable:
>
> import core.stdc.stdlib;
>
> void vla(int n) {
> int[] arr;
> arr = (cast(int*)alloca(n * int.sizeof))[0 .. n];
> }
>
>
> Program runs, let's see the code.
>
> 0805f3f0 <_D3vla3vlaFiZv>:
> 805f3f0: 55 push ebp
> 805f3f1: 8b ec mov ebp,esp
> 805f3f3: 83 ec 10 sub esp,0x10
> 805f3f6: c7 45 f0 10 00 00 00 mov DWORD PTR
> [ebp-0x10],0x10
> 805f3fd: 89 45 fc mov DWORD PTR
> [ebp-0x4],eax
> 805f400: c7 45 f4 00 00 00 00 mov DWORD PTR
> [ebp-0xc],0x0
> 805f407: c7 45 f8 00 00 00 00 mov DWORD PTR
> [ebp-0x8],0x0
> 805f40e: 8b 45 fc mov eax,DWORD PTR
> [ebp-0x4]
> 805f411: 50 push eax
> 805f412: c1 e0 02 shl eax,0x2
> 805f415: 50 push eax
> 805f416: 8d 4d f0 lea ecx,[ebp-0x10]
> 805f419: e8 e2 01 00 00 call 805f600
> <__alloca>
> 805f41e: 89 c1 mov ecx,eax
> 805f420: 83 c4 04 add esp,0x4
> 805f423: 58 pop eax
> 805f424: 89 45 f4 mov DWORD PTR
> [ebp-0xc],eax
> 805f427: 89 4d f8 mov DWORD PTR
> [ebp-0x8],ecx
> 805f42a: c9 leave
> 805f42b: c3 ret
>
>
> Change alloca to malloc:
>
> 0805f3f0 <_D3vla3vlaFiZv>:
> 805f3f0: 55 push ebp
> 805f3f1: 8b ec mov ebp,esp
> 805f3f3: 83 ec 0c sub esp,0xc
> 805f3f6: 89 45 fc mov DWORD PTR
> [ebp-0x4],eax
> 805f3f9: c7 45 f4 00 00 00 00 mov DWORD PTR
> [ebp-0xc],0x0
> 805f400: c7 45 f8 00 00 00 00 mov DWORD PTR
> [ebp-0x8],0x0
> 805f407: 8b 45 fc mov eax,DWORD PTR
> [ebp-0x4]
> 805f40a: 50 push eax
> 805f40b: c1 e0 02 shl eax,0x2
> 805f40e: 50 push eax
> 805f40f: e8 0c fc ff ff call 805f020
> <malloc at plt>
> 805f414: 89 c1 mov ecx,eax
> 805f416: 83 c4 04 add esp,0x4
> 805f419: 58 pop eax
> 805f41a: 89 45 f4 mov DWORD PTR
> [ebp-0xc],eax
> 805f41d: 89 4d f8 mov DWORD PTR
> [ebp-0x8],ecx
> 805f420: c9 leave
> 805f421: c3 ret
>
>
> Differences?
>
>
> We can see on line 3 that there's an extra word allocated for a
> local variable with alloca. It is loaded with the size of the
> local variables - 0x10. A pointer to that is passed to alloca.
>
> If we go back to the druntime source code:
>
> * ECX address of variable with # of bytes in locals
> * This is adjusted upon return to reflect the
> additional
> * size of the stack frame.
>
>
> It is used in that function:
> // Copy down to [ESP] the temps on the stack.
> // The number of temps is (EBP - ESP - locals).
> // snip
>
> sub ECX,[EDX] ; // ECX = number of temps (bytes) to
> move.
> add [EDX],ESI ; // adjust locals by nbytes
> for next call to alloca()
> // snip
> rep ;
> movsd ;
>
>
>
>
> So, instead of restoring the stack pointer upon function return
> like I did, this copies the relevant data that was pushed onto
> the stack to the new location, so a subsequent pop will find
> what it expects, then it adjusts the hidden local size variable
> so next time, it can repeat the process. Cool - that's
> something my solution wouldn't have done super easily (it
> totally could, just don't overwrite that variable once it is
> initialized).
>
>
> I guess there is a better way than I had figured above :)
>
>
>
>
>
> We can use that same trick the compiler did by declaring a
> local variable and moving the magic __LOCAL_SIZE (see:
> http://dlang.org/iasm.html ) value into it up front, then
> calling alloca exactly as the C does. The implementation can be
> the same as from druntime too.
>
>
> That's why it is a magic function: it needs to put the stack
> how it expects, somehow. My way was to add a store. The way
> actually used in druntime is to store the size of the locals in
> a hidden variable. Either way, if you do an iasm alloca
> yourself, you'll have to account for it as well.
>
>
> Otherwise, remember to store the right pointer and allocate the
> right number of bytes and you've got it.
That is an awesome explanation! :)
Thank you for your time, I will experiment with your code.
More information about the Digitalmars-d-learn
mailing list