Is this a bug?

H. S. Teoh hsteoh at quickfur.ath.cx
Wed Nov 27 14:53:32 PST 2013


[Taking this back to the forum, if you don't mind, since it seems to be
relevant.]

On Wed, Nov 27, 2013 at 03:36:17PM -0500, Jerry wrote:
> "H. S. Teoh" <hsteoh at quickfur.ath.cx> writes:
> 
> > I'm interested in the disassembly of the faulty executable. Maybe
> > you could run `objdump -D $program_name` and send me the output? I'm
> > curious to see what got messed up. (With your reduced test case,
> > that is. The disassembly of your actual program would be too
> > unwieldy.)
[...]
> Here's the disassembly.  I'm sending out of gnus in emacs.  If it
> doesn't come through, let me know and I'll resend from my regular mail
> prog.

It came through, thanks.

I compared the disassembly with my working version of the same code, and
found that the problem appears to be a wrong jump table in the switch
statement.

Here's the relevant snippet from near the end of CC.create():

[Jerry's version]:
  417590:	e8 1f 2d 00 00       	callq  41a2b4 <_d_switch_string>
  417595:	48 89 c6             	mov    %rax,%rsi
  417598:	83 fe 03             	cmp    $0x3,%esi
  41759b:	77 18                	ja     4175b5 <_D9switchbug2CC6createFAyaZC9switchbug2CC+0x59>
  41759d:	ff 24 f5 20 01 43 00 	jmpq   *0x430120(,%rsi,8)
  4175a4:	48 bf a0 72 43 00 00 	movabs $0x4372a0,%rdi
  4175ab:	00 00 00 
  4175ae:	e8 55 1b 00 00       	callq  419108 <_d_newclass>
  4175b3:	c9                   	leaveq 
  4175b4:	c3                   	retq   

In simple terms, what this code does is:

417590: calls _d_switch_string, a function in druntime that does the
string comparisons in a switch over string values, and returns a uint
index of the matching switch case number (uint.max if not found). In
this case, you have 4 switch cases, so they map to indices 0, 1, 2, 3.

41759b: checks the return value of _d_switch_string, and if it's > 3,
then branch to <CC.create + 0x59>, which is where the default case is
implemented (not quoted above, but if you look in your disassembly
you'll see that it creates an Exception then calls the stack unwinding
routine). Since we aren't hitting the default case in our test case, the
control would pass to the next instruction.

41759d: here's the interesting part. This looks up a jump table at
0x430120 using the index returned by _d_switch_string, and branches to
that address. Looking up this address in the disassembly dump, I find
this:

0000000000430120 <_D9switchbug2BB6__initZ>:
  430120:	40 01 43 00          	rex add %eax,0x0(%rbx)
	...
  430130:	73 77                	jae    4301a9 <_D9switchbug2DD6__vtblZ+0x19>

Now this looks odd. _D9switchbug2BB6__initZ looks like the typeinfo for
class BB; it is certainly NOT a switch statement jump table!! The first
4 bytes, in fact, corresponds to the address 430140, which contains
this:

0000000000430140 <_D9switchbug2BB6__vtblZ>:
  430140:	00 72 43             	add    %dh,0x43(%rdx)
  430143:	00 00                	add    %al,(%rax)
  430145:	00 00                	add    %al,(%rax)
  430147:	00 90 76 41 00 00    	add    %dl,0x4176(%rax)
  43014d:	00 00                	add    %al,(%rax)
  [... snipped ...]

This is the virtual function table of class BB, so it's *definitely* not
a valid jump destination of a switch statement!

So this looks like where the problem came from. If the CPU ends up here,
it would try to interpret function pointers as instructions, and would
basically do random nonsensical things until it hits something that
can't be interpreted as an instruction, or tries to access a random
address that's outside the process address space, upon which the OS
kicks in and sends a SIGSEGV to terminate the program.

...

Now, for comparison, here is the corresponding disassembly from my
working version of your code:

[Teoh's version, in CC.create]:
  417084:	e8 a7 27 00 00       	callq  419830 <_d_switch_string>
  417089:	48 89 c6             	mov    %rax,%rsi
  41708c:	83 fe 03             	cmp    $0x3,%esi
  41708f:	77 18                	ja     4170a9 <_D4test2CC6createFAyaZC4test2CC+0x59>
  417091:	ff 24 f5 60 e9 42 00 	jmpq   *0x42e960(,%rsi,8)
  417098:	48 bf 40 45 63 00 00 	movabs $0x634540,%rdi
  41709f:	00 00 00 
  4170a2:	e8 dd 15 00 00       	callq  418684 <_d_newclass>
  4170a7:	c9                   	leaveq 
  4170a8:	c3                   	retq   

Other than the different addresses, which are to be expected (different
compile environments, etc.), this code is basically the same as in your
version. The only difference lies in the contents of the jump table,
which in my version is at 42e960 (as can be seen from the instruction at
417091 above), which contains:

  42e95f:	00 98 70 41 00 00    	add    %bl,0x4170(%rax)
  42e965:	00 00                	add    %al,(%rax)
  42e967:	00 98 70 41 00 00    	add    %bl,0x4170(%rax)
  42e96d:	00 00                	add    %al,(%rax)
  42e96f:	00 98 70 41 00 00    	add    %bl,0x4170(%rax)
  42e975:	00 00                	add    %al,(%rax)
  42e977:	00 98 70 41 00 00    	add    %bl,0x4170(%rax)
  42e97d:	00 00                	add    %al,(%rax)
	...

(Note that since this part of the code isn't instructions, the
disassembler got a bit confused trying to interpret them as
instructions, so the addresses are 1 byte off. So the jump table
actually starts at the bytes 98 70 41 00 ... .)

Now, *this* looks like a proper jump table. In fact, the aforementioned
bytes represent the address 417098, which, if you look at the
disassembly snippet above, is the very next instruction after the jump
table lookup. This makes sense, since the next thing it does is to call
_d_newclass to create an instance of DD, after which it simply returns
(leaving the address of the new instance of DD in %rax, which is the
register containing the return value as per x86 calling conventions).
This corresponds with what the source code says. So here, everything is
correct and the program works as expected.

...

Now, all this begs the question of why dmd produced the wrong code in
your environment, but produces the *right* code in mine. Since we're
both using the same dmd source code (I believe!), it seems that the most
likely culprit must be the linker -- especially since you mentioned
something about gold vs. ld. My conjecture is that somehow the linker
got mixed up, and wrote the wrong address for the switch's jump table
into your executable. (These addresses are generally not fixed until
link time, because the compiler doesn't know in advance exactly where
each symbol will end up in the final executable.)

Further confirmation for this can be found by searching for the byte
sequence a4 75 41 in your disassembly dump (this is the address 4175a4,
the next instruction after the jump table lookup, which is where the
jump destination *should* have been in the first place). This sequence
appears here in your version of the executable:

00000000004301f0 <_TMP3>:
  [... snipped ...]
  4301ff:	00 a4 75 41 00 00 00 	add    %ah,0x41(%rbp,%rsi,2)
  430206:	00 00                	add    %al,(%rax)
  430208:	a4                   	movsb  %ds:(%rsi),%es:(%rdi)
  430209:	75 41                	jne    43024c <_D9switchbug2CC6__vtblZ+0xc>
  43020b:	00 00                	add    %al,(%rax)
  43020d:	00 00                	add    %al,(%rax)
  43020f:	00 a4 75 41 00 00 00 	add    %ah,0x41(%rbp,%rsi,2)
  430216:	00 00                	add    %al,(%rax)
  430218:	a4                   	movsb  %ds:(%rsi),%es:(%rdi)
  430219:	75 41                	jne    43025c <_D9switchbug2CC6__vtblZ+0x1c>
  43021b:	00 00                	add    %al,(%rax)
  43021d:	00 00                	add    %al,(%rax)
	...

Ignoring the instructions on the right (they are basically objdump
getting confused by data that aren't intended to be instructions), we
see that the sequence a4 75 41 00 00 00 00 00 appears exactly 4 times,
consecutively. This corresponds with the 4 cases of the switch
statement. So, this must be where the real jump table is, at 430200, NOT
at 430120 (which is 128 (0x80) bytes too early).

It seems unlikely that the compiler would screw things up *this* bad
(esp. since it didn't screw up in my environment!), so this seems to
reinforce my conclusion that the fault lies with the linker.

Well, I hope this helps. :)


[...]
> p.s. How do you prefer to be addressed?

I go by my last name.


T

-- 
Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen


More information about the Digitalmars-d mailing list