[Issue 15939] New: GC.collect causes deadlock in multi-threaded environment

Mon Apr 18 11:58:33 PDT 2016

https://issues.dlang.org/show_bug.cgi?id=15939

          Issue ID: 15939
           Summary: GC.collect causes deadlock in multi-threaded
                    environment
           Product: D
           Version: D2
          Hardware: x86_64
                OS: Linux
            Status: NEW
          Severity: blocker
          Priority: P1
         Component: druntime
          Assignee: nobody at puremagic.com
          Reporter: apreobrazhensky at gmail.com

I have multi-threaded application with threads doing memory intensive work and
main thread cleaning up the garbage periodically by calling GC.collect
manually. Sometimes GC.collect causes deadlock. I don't have simple example,
but I do have stack traces of the threads at the moment of the deadlock.

It happens both for dmd 2.071.0 and for dmd 2.070.* (so it is not related to
the recent GC spinlock change).

That seems like a blocker to me, I suspect that if it happens when I call it
manually it could also happen during normal collections. I'm not familiar with
runtime code, but I would expect some sort of race condition judging from stack
traces below.

Configuration:

dmd 2.071.0 with -O -release -inline -boundscheck=off
Linux 3.2.0-95-generic #135-Ubuntu SMP Tue Nov 10 13:33:29 UTC 2015 x86_64
x86_64 x86_64 GNU/Linux

That's the main thread's stack trace.

Thread 1 (Thread 0x7ff6653bb6c0 (LWP 6857)):
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x00000000007b3ff6 in thread_suspendAll ()
#2  0x000000000079980d in gc.gc.Gcx.fullcollect() ()
#3  0x000000000079c2b2 in
gc.gc.GC.__T9runLockedS49_D2gc2gc2GC11fullCollectMFNbZ2goFNbPS2gc2gc3GcxZmTPS2gc2gc3GcxZ.runLocked()
()
#4  0x0000000000796535 in gc.gc.GC.fullCollect() ()
#5  0x000000000076091c in gc_collect ()
...application stack

That's how stack trace looks like for the threads which were suspended
correctly.

Thread XX (Thread 0x7ff5c6ffd700 (LWP 6897)):
#0  0x00007ff6640e6454 in do_sigsuspend (set=0x7ff5c6ff9bc0) at
../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1  __GI___sigsuspend (set=<optimized out>) at
../sysdeps/unix/sysv/linux/sigsuspend.c:78
#2  0x00000000007c0401 in core.thread.thread_suspendHandler() ()
#3  0x00000000007c045c in core.thread.callWithStackShell() ()
#4  0x00000000007c038f in thread_suspendHandler ()
#5  <signal handler called>
... application stack

That's how stack trace looks like for the threads which weren't suspended:

Thread YY (Thread 0x7ff5c67fc700 (LWP 6898)):
#0  0x00007ff664d9b52d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x000000000075dfde in core.thread.Thread.sleep() ()
#2  0x00000000007b46e0 in core.internal.spinlock.SpinLock.yield() ()
#3  0x00000000007b467c in core.internal.spinlock.SpinLock.lock() ()
#4  0x000000000079bc21 in
gc.gc.GC.__T9runLockedS46_D2gc2gc2GC12extendNoSyncMFNbPvmmxC8TypeInfoZmS21_D2gc2gc10extendTimelS21_D2gc2gc10numExtendslTPvTmTmTxC8TypeInfoZ.runLocked()
()
#5  0x0000000000760bcc in gc_extend ()
#6  0x0000000000763c85 in _d_arraysetlengthT ()
... application stack

Thread ZZ (Thread 0x7ff566ffd700 (LWP 6918)):
#0  0x00007ff664d9b52d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x000000000075dfde in core.thread.Thread.sleep() ()
#2  0x00000000007b46e0 in core.internal.spinlock.SpinLock.yield() ()
#3  0x00000000007b467c in core.internal.spinlock.SpinLock.lock() ()
#4  0x000000000079ba3c in
gc.gc.GC.__T9runLockedS47_D2gc2gc2GC12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS21_D2gc2gc10mallocTimelS21_D2gc2gc10numMallocslTmTkTmTxC8TypeInfoZ.runLocked()
()
#5  0x00000000007953be in gc.gc.GC.malloc() ()
#6  0x0000000000760a04 in gc_malloc ()
#7  0x0000000000762c43 in _d_newclass ()
... application stack

--