[dmd-internals] Calling Convention Bugs?

Sat Apr 30 13:21:48 PDT 2011

I've spent a whole bunch of time trying to debug my std.parallelism 
Heisenbug from Hell.  My latest theory is that the root cause is 
actually a calling convention bug (either in DMD or in some inline ASM; 
the results below were produced using core.atomic, not any of my own 
inline ASM) but only manifests in std.parallelism under very specific 
thread interleavings that lead to infrequently used code paths.

On Windows, I get failed unit tests instead of segfaults, which makes 
life slightly easier.  (Hopefully these have the same root cause, I'm 
guessing they do.)  I finally managed to instrument my Map pipelining 
class to print a whole bunch of data whenever the bug occurs, without 
changing the timings to prevent the bug from occurring.  The issue 
appears to be that, about one out of every 10,000 times, the whole 
source buffer gets corrupted.  However, the funny thing is that there's 
a very specific pattern to the corruption.  For example, the following 
buffer is supposed to contain the square roots of all numbers between 
1380299 and 1380399 but instead has the following:

[-8.29044e+16, -8.45933e+16, -8.68451e+16, -8.85339e+16, -9.07857e+16, 
-9.24746e+16, -9.47264e+16, -9.64152e+16, -9.8667e+16, -1.00356e+17, 
-1.02608e+17, -1.04297e+17, -1.06548e+17, -1.08237e+17, -1.10489e+17, 
-1.12178e+17, -1.1443e+17, -1.16118e+17, -1.1837e+17, -1.20059e+17, 
-1.22311e+17, -1.24e+17, -1.26252e+17, -1.2794e+17, -1.30192e+17, 
-1.31881e+17, -1.34133e+17, -1.35822e+17, -1.38074e+17, -1.39762e+17, 
-1.42014e+17, -1.43703e+17, -1.47794e+17, -1.51172e+17, -1.55676e+17, 
-1.59053e+17, -1.63557e+17, -1.66935e+17, -1.70312e+17, -1.74816e+17, 
-1.78194e+17, -1.82697e+17, -1.86075e+17, -1.90579e+17, -1.93956e+17, 
-1.9846e+17, -2.01838e+17, -2.06341e+17, -2.09719e+17, -2.14223e+17, 
-2.176e+17, -2.22104e+17, -2.25482e+17, -2.29985e+17, -2.33363e+17, 
-2.37866e+17, -2.41244e+17, -2.45748e+17, -2.49125e+17, -2.53629e+17, 
-2.57007e+17, -2.6151e+17, -2.64888e+17, -2.69392e+17, -2.72769e+17, 
-2.77273e+17, -2.80651e+17, -2.85154e+17, -2.88834e+17, -2.97841e+17, 
-3.04596e+17, -3.13603e+17, -3.20359e+17, -3.27114e+17, -3.36121e+17, 
-3.42877e+17, -3.51884e+17, -3.58639e+17, -3.67647e+17, -3.74402e+17, 
-3.83409e+17, -3.90165e+17, -3.99172e+17, -4.05927e+17, -4.14934e+17, 
-4.2169e+17, -4.30697e+17, -4.37452e+17, -4.4646e+17, -4.53215e+17, 
-4.62222e+17, -4.68978e+17, -4.77985e+17, -4.8474e+17, -4.93747e+17, 
-5.00503e+17, -5.0951e+17, -5.16265e+17, -5.25273e+17, 2.45984e-41]

The interesting thing is that there's a very regular pattern to these 
numbers.  They're always sorted like this, except that the last number 
is  always tiny.  Don't ask me how I thought of this, but I figured the 
low-order bits of the pointer to this buffer must somehow be getting 
corrupted, so that the numbers are being viewed and/or written 
out-of-frame.  Sure enough, when I shift the reading frame of this array 
by two bytes, I get:

[1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 
1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 
1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 
1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 
1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 
1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 
1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.89, 
1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 
1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 
1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.9, 
1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 
1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 
1174.9]

To within rounding error (since the numbers are only printed to two 
decimal places), 1174.86 ^ 2 == 1380299 and 1149.4 ^ 2 == 1380399.  The 
other failures display similar patterns, where the source buffer is 
being read or written to two bytes out of frame.  There's even a 
significance to the number 2.  This is the enum value that's used to 
signal TaskStatus.done.  It gets passed to the atomic functions all the 
time, probably in an 8-bit register (AL, BL, CL, DL, etc.).  Sure 
enough, if I change TaskStatus.done to be 3 instead of 2, my reading 
frame ends up 3 bytes out of phase instead of 2.  Since the source 
buffer is 100 elements long, and each element is 4 bytes, it is 
allocated in the GC's 512-byte pool and is aligned on 512-byte 
boundaries.  Therefore, allowing TaskStatus.done to overwrite the 
low-order bits of a 32- or 64-bit register holding the pointer to the 
source buffer would shift the reading frame by TaskStatus.done bytes.

Further proving that this is the issue, when I print out the address of 
the corrupt buffer, it's actually 2 or 3 (depending on the value of 
TaskStatus.done) bytes offset from a proper 4-byte float alignment.

Does anyone know if there's already a bug report about this?  I barely 
know where to start trying to create a sane test case that's not 
dependent on some ridiculously complicated code path and thread 
interleavings, but I do know that this qualifies as war story debugging 
and will make for a great answer if I'm ever on a job interview and the 
interviewer wants to know what the toughest thing I ever debugged is.