GC buggy in windows?

tchaloupka chalucha at gmail.com
Fri Nov 8 13:40:19 UTC 2019


We've experiencing some really strange nasty GC behavior in our 
IOCP I/O heavy windows app.

Sometimes it hangs with just: "Unable to load thread context"

I've spend last three days with experimenting and trying to 
narrow it somehow to find exact cause :(

The problem is in GC and it's stop the world behavior.
In core.thread.osthread.sleep method there is basically:

```
SuspendThread( t.m_hndl );
GetThreadContext( t.m_hndl, &context );
```

In some cases GetThreadContext returns `ERROR_GEN_FAILURE(31)` 
which leads to the error being thrown.

First problem is, that application doesn't terminate after this 
error, but just hangs.
That's because thread is still suspended and somewhere down the 
line `join` is called on this thread which won't return - ever.

This is a nice blog explaining that the `SuspendThread` is 
actually asynchronnous: 
https://devblogs.microsoft.com/oldnewthing/?p=44743

But it also states that when `GetThreadContext` is called on it, 
we can be sure that it is actually already suspended.

So what could lead to the error? Searching in windows API 
documentation - nah, nothing as usual..

Searching on the internet - sure a lot of problems with some game 
engines using GC (unity) combined with some anticheat or 
antivirus programs - not our case.

Ok, so I've tried to compile custom druntime (what a pleasure 
itself) and found that:

* when you try to Thread.yield and get context again, it doesn't 
help, still error
* only way I could workaround this problem was resuming back the 
thread again, Thread.yield, suspend thread and try the context 
again, usually first or second try succeeds - HOORAY.

Then I've spent a lot of time figuring what is actually causing 
the error and I have a theory that the problem is with some IO 
operation being run in kernel context that can't finish when the 
thread is suspended and so the error is returned.

I ended up with this minimized test app that causes this error 
really fast.

```
import core.memory : GC;
import core.stdc.stdio;
import core.thread;
import std.random;
import std.range;

void main() {
	Thread t;
	while (true) {
		GC.collect();
		if (t is null || !t.isRunning) {
			t = new Thread(&threadProc);
			t.start();
		}
	}
}

void threadProc() {
	foreach (_; iota(uniform(0, 100))) {
		FILE* f = fopen("dummy", "a");
		scope (exit) fclose(f);
	}
}
```

compiled with: `dmd -m64 -debug test.d`
Tested on 64bit Windows 10.

I definitely think that this is a bug in a windows GC 
implementation.

Should I fill it?

What seems to be a fix to both of them is:
* retry the resume/suspend/get context on the failing thread some 
more - how many times?
* before returning the error resume the thread so it can be 
joined (I haven't looked from where it's being called on 
termination)

For me it is also questionable if terminating the application in 
this case is even the correct behavior. It might be better to 
scratch the GC attempt, resume the threads and retry on next 
collection? That might lead to other problems but as this occurs 
pretty rarely it might have a better outcome. Ideas?

PS: I'm beginning to understand the C/C++ devs to don't like GC 
languages ;-)
PPS: Now I hate windows even more.. (normally a linux dev)
PPPS: This kind of experience would definitely led away devs that 
just need to have "shit done" and don't bother with the tool 
used..


More information about the Digitalmars-d mailing list