GC buggy in windows?
tchaloupka
chalucha at gmail.com
Fri Nov 8 13:40:19 UTC 2019
We've experiencing some really strange nasty GC behavior in our
IOCP I/O heavy windows app.
Sometimes it hangs with just: "Unable to load thread context"
I've spend last three days with experimenting and trying to
narrow it somehow to find exact cause :(
The problem is in GC and it's stop the world behavior.
In core.thread.osthread.sleep method there is basically:
```
SuspendThread( t.m_hndl );
GetThreadContext( t.m_hndl, &context );
```
In some cases GetThreadContext returns `ERROR_GEN_FAILURE(31)`
which leads to the error being thrown.
First problem is, that application doesn't terminate after this
error, but just hangs.
That's because thread is still suspended and somewhere down the
line `join` is called on this thread which won't return - ever.
This is a nice blog explaining that the `SuspendThread` is
actually asynchronnous:
https://devblogs.microsoft.com/oldnewthing/?p=44743
But it also states that when `GetThreadContext` is called on it,
we can be sure that it is actually already suspended.
So what could lead to the error? Searching in windows API
documentation - nah, nothing as usual..
Searching on the internet - sure a lot of problems with some game
engines using GC (unity) combined with some anticheat or
antivirus programs - not our case.
Ok, so I've tried to compile custom druntime (what a pleasure
itself) and found that:
* when you try to Thread.yield and get context again, it doesn't
help, still error
* only way I could workaround this problem was resuming back the
thread again, Thread.yield, suspend thread and try the context
again, usually first or second try succeeds - HOORAY.
Then I've spent a lot of time figuring what is actually causing
the error and I have a theory that the problem is with some IO
operation being run in kernel context that can't finish when the
thread is suspended and so the error is returned.
I ended up with this minimized test app that causes this error
really fast.
```
import core.memory : GC;
import core.stdc.stdio;
import core.thread;
import std.random;
import std.range;
void main() {
Thread t;
while (true) {
GC.collect();
if (t is null || !t.isRunning) {
t = new Thread(&threadProc);
t.start();
}
}
}
void threadProc() {
foreach (_; iota(uniform(0, 100))) {
FILE* f = fopen("dummy", "a");
scope (exit) fclose(f);
}
}
```
compiled with: `dmd -m64 -debug test.d`
Tested on 64bit Windows 10.
I definitely think that this is a bug in a windows GC
implementation.
Should I fill it?
What seems to be a fix to both of them is:
* retry the resume/suspend/get context on the failing thread some
more - how many times?
* before returning the error resume the thread so it can be
joined (I haven't looked from where it's being called on
termination)
For me it is also questionable if terminating the application in
this case is even the correct behavior. It might be better to
scratch the GC attempt, resume the threads and retry on next
collection? That might lead to other problems but as this occurs
pretty rarely it might have a better outcome. Ideas?
PS: I'm beginning to understand the C/C++ devs to don't like GC
languages ;-)
PPS: Now I hate windows even more.. (normally a linux dev)
PPPS: This kind of experience would definitely led away devs that
just need to have "shit done" and don't bother with the tool
used..
More information about the Digitalmars-d
mailing list