<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
On 4/30/2011 3:26 AM, Walter Bright wrote:
<blockquote cite="mid:4DBBB98D.6020706@digitalmars.com" type="cite">I
have a dual core system, if that helps.
<br>
_______________________________________________
<br>
phobos mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:phobos@puremagic.com">phobos@puremagic.com</a>
<br>
<a class="moz-txt-link-freetext" href="http://lists.puremagic.com/mailman/listinfo/phobos">http://lists.puremagic.com/mailman/listinfo/phobos</a>
<br>
<br>
</blockquote>
Yeah, I have no idea what to do about the issues with
std.parallelism. Both subtle concurrency bugs and subtle codegen
bugs tend to be Heisenbugs. I'm not 100% sure which one we have or
whether all the weirdness we've been seeing lately even has the same
root cause. Here are some comments:<br>
<br>
1. The facts that we don't see any similar bugs anywhere else and
that the manifestations are rare and non-deterministic are evidence
that it's a concurrency bug. For a while I thought this was the
classic concurrency bug that only occurs on other people's
machines. Note that the bug is triggered by Map, which I haven't
used in production code yet. I use all the other primitives in
production code on several machines and platforms. If they were
this buggy I'd have noticed by now. Map is by far the least tested
of the primitives.<br>
<br>
2. As I posted on the DMD-internals mailing list, the root cause of
some unit test failures I was getting very rarely on Windows appears
to be clobbering of the low-order bits of pointers with the value of
TaskStatus.done. (See
<a class="moz-txt-link-freetext" href="http://lists.puremagic.com/pipermail/dmd-internals/2011-April/001478.html">http://lists.puremagic.com/pipermail/dmd-internals/2011-April/001478.html</a>)
This is evidence that it's a codegen bug, since clobbering the
low-order 8 bits of a 32-bit register that holds a pointer is the
simplest, most obvious explanation for this behavior. It's hard to
see how a race condition could cause something like this. However,
if it is a codegen bug, the wrong code only shows up on very rarely
taken code paths. (Of course, thread interleavings affect the code
path taken, especially in the Map test that triggers all these
bugs.)<br>
<br>
3. I don't have a clue why the failures occur so much more
frequently on FreeBSD than other OS's. Pure speculation suggests
that either the scheduler interleaves threads in a much more
"dangerous" way or some platform specific function or calling
convention makes register management bugs more apparent.<br>
<br>
4. I'm no closer than I was yesterday to isolating a decent test
case for the low-order bit corruption bug. It's <b>probably </b>related
to clobbering the low-order bits of a 32-bit register with its 8-bit
alias, but I can't <b>prove</b> that yet.<br>
<br>
5. Since this low-order bit corruption bug is capable of corrupting
pointers, it's probably but not provably related to all the other
weirdness we've been seeing. For example, if the prev or next
pointers of Task get corrupted, all Hell will break loose and
segfaults and other erratic behavior will happen.<br>
<br>
6. Since I changed TaskPool.tryDeleteExecute() to have a try/catch
block (see my latest changeset), I've apparently perturbed the unit
test failures out of existence on Windows. If I remove this
try/catch block but keep all the other changes from the changeset,
the unittest starts failing again. Somehow this extra code is
perturbing a subtle race condition or codegen bug out of existence.<br>
</body>
</html>