[dmd-internals] Calling Convention Bugs?
David Simcha
dsimcha at gmail.com
Sat Apr 30 13:21:48 PDT 2011
I've spent a whole bunch of time trying to debug my std.parallelism
Heisenbug from Hell. My latest theory is that the root cause is
actually a calling convention bug (either in DMD or in some inline ASM;
the results below were produced using core.atomic, not any of my own
inline ASM) but only manifests in std.parallelism under very specific
thread interleavings that lead to infrequently used code paths.
On Windows, I get failed unit tests instead of segfaults, which makes
life slightly easier. (Hopefully these have the same root cause, I'm
guessing they do.) I finally managed to instrument my Map pipelining
class to print a whole bunch of data whenever the bug occurs, without
changing the timings to prevent the bug from occurring. The issue
appears to be that, about one out of every 10,000 times, the whole
source buffer gets corrupted. However, the funny thing is that there's
a very specific pattern to the corruption. For example, the following
buffer is supposed to contain the square roots of all numbers between
1380299 and 1380399 but instead has the following:
[-8.29044e+16, -8.45933e+16, -8.68451e+16, -8.85339e+16, -9.07857e+16,
-9.24746e+16, -9.47264e+16, -9.64152e+16, -9.8667e+16, -1.00356e+17,
-1.02608e+17, -1.04297e+17, -1.06548e+17, -1.08237e+17, -1.10489e+17,
-1.12178e+17, -1.1443e+17, -1.16118e+17, -1.1837e+17, -1.20059e+17,
-1.22311e+17, -1.24e+17, -1.26252e+17, -1.2794e+17, -1.30192e+17,
-1.31881e+17, -1.34133e+17, -1.35822e+17, -1.38074e+17, -1.39762e+17,
-1.42014e+17, -1.43703e+17, -1.47794e+17, -1.51172e+17, -1.55676e+17,
-1.59053e+17, -1.63557e+17, -1.66935e+17, -1.70312e+17, -1.74816e+17,
-1.78194e+17, -1.82697e+17, -1.86075e+17, -1.90579e+17, -1.93956e+17,
-1.9846e+17, -2.01838e+17, -2.06341e+17, -2.09719e+17, -2.14223e+17,
-2.176e+17, -2.22104e+17, -2.25482e+17, -2.29985e+17, -2.33363e+17,
-2.37866e+17, -2.41244e+17, -2.45748e+17, -2.49125e+17, -2.53629e+17,
-2.57007e+17, -2.6151e+17, -2.64888e+17, -2.69392e+17, -2.72769e+17,
-2.77273e+17, -2.80651e+17, -2.85154e+17, -2.88834e+17, -2.97841e+17,
-3.04596e+17, -3.13603e+17, -3.20359e+17, -3.27114e+17, -3.36121e+17,
-3.42877e+17, -3.51884e+17, -3.58639e+17, -3.67647e+17, -3.74402e+17,
-3.83409e+17, -3.90165e+17, -3.99172e+17, -4.05927e+17, -4.14934e+17,
-4.2169e+17, -4.30697e+17, -4.37452e+17, -4.4646e+17, -4.53215e+17,
-4.62222e+17, -4.68978e+17, -4.77985e+17, -4.8474e+17, -4.93747e+17,
-5.00503e+17, -5.0951e+17, -5.16265e+17, -5.25273e+17, 2.45984e-41]
The interesting thing is that there's a very regular pattern to these
numbers. They're always sorted like this, except that the last number
is always tiny. Don't ask me how I thought of this, but I figured the
low-order bits of the pointer to this buffer must somehow be getting
corrupted, so that the numbers are being viewed and/or written
out-of-frame. Sure enough, when I shift the reading frame of this array
by two bytes, I get:
[1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86, 1174.86,
1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87,
1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87,
1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87, 1174.87,
1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88,
1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88,
1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.88, 1174.89,
1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89,
1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89,
1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.89, 1174.9,
1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9,
1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9, 1174.9,
1174.9]
To within rounding error (since the numbers are only printed to two
decimal places), 1174.86 ^ 2 == 1380299 and 1149.4 ^ 2 == 1380399. The
other failures display similar patterns, where the source buffer is
being read or written to two bytes out of frame. There's even a
significance to the number 2. This is the enum value that's used to
signal TaskStatus.done. It gets passed to the atomic functions all the
time, probably in an 8-bit register (AL, BL, CL, DL, etc.). Sure
enough, if I change TaskStatus.done to be 3 instead of 2, my reading
frame ends up 3 bytes out of phase instead of 2. Since the source
buffer is 100 elements long, and each element is 4 bytes, it is
allocated in the GC's 512-byte pool and is aligned on 512-byte
boundaries. Therefore, allowing TaskStatus.done to overwrite the
low-order bits of a 32- or 64-bit register holding the pointer to the
source buffer would shift the reading frame by TaskStatus.done bytes.
Further proving that this is the issue, when I print out the address of
the corrupt buffer, it's actually 2 or 3 (depending on the value of
TaskStatus.done) bytes offset from a proper 4-byte float alignment.
Does anyone know if there's already a bug report about this? I barely
know where to start trying to create a sane test case that's not
dependent on some ridiculously complicated code path and thread
interleavings, but I do know that this qualifies as war story debugging
and will make for a great answer if I'm ever on a job interview and the
interviewer wants to know what the toughest thing I ever debugged is.
More information about the dmd-internals
mailing list