Openwrt Linux Uclibc ARM GC issue

Radu void at null.pt
Wed Jan 10 00:27:47 UTC 2018


On Sunday, 17 December 2017 at 19:05:04 UTC, Joakim wrote:
> On Sunday, 17 December 2017 at 17:12:41 UTC, Radu wrote:
>> On Friday, 15 December 2017 at 14:24:08 UTC, David Nadlinger 
>> wrote:
>>> On 15 Dec 2017, at 14:06, Radu via digitalmars-d-ldc wrote:
>>>> When run, I get this error spuriously:
>>>>
>>>> ====================================
>>>> core.exception.AssertError at rt/sections_elf_shared.d(116): 
>>>> Assertion failure
>>>> Fatal error in EH code: _Unwind_RaiseException failed with 
>>>> reason code: 9
>>>> Aborted (core dumped)
>>>> ====================================
>>>
>>> The assert is inside an invariant which checks that the TLS 
>>> information has been extracted successfully. Perhaps uclibc 
>>> uses a TLS implementation that is not ABI-compatible with 
>>> glibc? (druntime needs to determine the TLS ranges to 
>>> register them with the GC, for the main thread as well as 
>>> newly spawned ones.)
>>>
>>> Where in the program lifecycle does the error occur? From the 
>>> backtrace, it looks like during C runtime startup, in which 
>>> case I am not quite seeing the connection to the GC.
>>>
>>> Why unwinding fails is another question, but not one I would 
>>> be terribly worried about – it is possible that the error 
>>> e.g. just occurs too early for the EH machinery to be 
>>> properly set up yet. Other low-level parts of druntime have 
>>> been converted to directly abort (e.g. using assert(0)) 
>>> instead. In fact, I am about to overhaul sections_elf_shared 
>>> in that respect anyway to improve error reporting when mixing 
>>> shared and non-shared builds.
>>>
>>>  — David
>>
>> My various attempts on getting it to run behaved very erratic.
>> So I changed the parameters for cross compile, basically I 
>> removed all architecture specifics leaving only 
>> `-mtriple=arm-linux-gnueabihf`, and `-mfloat-abi=hard` on C 
>> side.
>>
>> My testing hardware is a ARM Cortex-A7, 
>> http://linux-sunxi.org/A33
>
> I believe that triple defaults to ARMv5, are you sure your 
> Openwrt kernel is built for ARMv7?  Try running uname -m on the 
> device to check.  For example, most low- to mid-level 
> smartphones these days ship with ARMv8 chips but the kernel is 
> only built for 32-bit ARMv7, so they can only run 32-bit apps.
>
>> With the compiler switches changed I could run my test program 
>> and try the druntime test runner (albeit with some changes on 
>> math and stdio to get it linking):
>>
>> ./druntime-test-runner
>> 0.000s PASS release32 core.atomic
>> 0.000s PASS release32 core.bitop
>> 0.000s PASS release32 core.checkedint
>> 0.005s PASS release32 core.demangle
>> 0.000s PASS release32 core.exception
>> 0.002s PASS release32 core.internal.arrayop
>> 0.000s PASS release32 core.internal.convert
>> 0.000s PASS release32 core.internal.hash
>> 0.000s PASS release32 core.internal.string
>> 0.000s PASS release32 core.math
>> 0.000s PASS release32 core.memory
>> 0.002s PASS release32 core.sync.barrier
>> 0.015s PASS release32 core.sync.condition
>> 0.000s PASS release32 core.sync.config
>> 0.016s PASS release32 core.sync.mutex
>> 0.016s PASS release32 core.sync.rwmutex
>> 0.002s PASS release32 core.sync.semaphore
>> Segmentation fault (core dumped)
>>
>> The seg fault is from core.thread:1351
>>
>> unittest
>> {
>>     auto t1 = new Thread({
>>         foreach (_; 0 .. 20)
>>             Thread.getAll;
>>     }).start;
>>     auto t2 = new Thread({
>>         foreach (_; 0 .. 20)
>>             GC.collect; // this seg faults
>>     }).start;
>>     t1.join();
>>     t2.join();
>> }
>>
>> Calling GC.collect from the main thread doesn't seg fault.
>
> Try running core.thread alone and see if it makes a difference, 
> ./druntime-test-runner core.thread, as I've sometimes seen 
> tested modules interfere with each other.  I see that there are 
> a few places where Glibc is assumed in core.thread, make sure 
> those are right on Uclibc too:
>
> https://github.com/ldc-developers/druntime/blob/ldc-v1.6.0/src/core/thread.d#L3301
> https://github.com/ldc-developers/druntime/blob/ldc-v1.6.0/src/core/thread.d#L3410
>
> You can also try skipping those tests that segfault for now and 
> make sure everything else works, by adding something like 
> version(skip) before that failing unittest block, so you know 
> the extent of the test problems.
>
>> Core dump is not very helpful - stack is garbage, but running 
>> with gdbserver a minimal program with the unit test I can see 
>> this:
>>
>> Thread 1 "test" received signal SIGUSR1, User defined signal 1.
>> pthread_getattr_np (thread_id=0, attr=0xb6b302bc) at 
>> libpthread/nptl/pthread_getattr_np.c:47
>> 47        iattr->schedpolicy = thread->schedpolicy;
>> (gdb) step
>>
>> Thread 1 "test" received signal SIGUSR2, User defined signal 2.
>> 0xb6e50d80 in epoll_wait (epfd=-1090521272, events=0x8, 
>> maxevents=2, timeout=-1224756080) at 
>> libc/sysdeps/linux/common/epoll.c:58
>> 58      CANCELLABLE_SYSCALL(int, epoll_wait, (int epfd, struct 
>> epoll_event *events, int maxevents, int timeout),
>> (gdb) step
>>
>> Thread 1 "test" received signal SIGSEGV, Segmentation fault.
>> 0xfffffffc in ?? ()
>> (gdb)
>
> The SIGUSR1/SIGUSR2 signals mean the GC ran fine.  You'd need 
> to delve more into the code and the implementation details 
> mentioned above to track this down.
>
> On Sunday, 17 December 2017 at 17:20:32 UTC, Radu wrote:
>> Yes - latest LDC versions make cross compiling a breeze so 
>> kudos to you guys for making this happening. I'm using Linux 
>> subsystem for Window btw. so for me this is even more fun as I 
>> can work on both environments natively :)
>
> Yeah, you could just use the Windows ldc too, assuming you have 
> a cross-compiler from that OS, as shown on the wiki for Windows 
> with the Android NDK.
>
>> The modifications need it surface deep are very few - some 
>> math and memory streams functions are missing.
>
> I don't know how much it differs from Glibc, but we'd always be 
> interested in a port, assuming you have the time to submit a 
> pull like this recent one for Musl:
>
> https://github.com/dlang/druntime/pull/1997
>
>> The road block looks to be somewhere in the GC and TLS, or the 
>> interaction of them (at least this is my feeling ATM)
>
> Not being able to do an explicit collect there isn't that big a 
> deal: I'd skip that test for now and run everything else, then 
> come back to that one once you have an idea of the bigger 
> picture.

Got some time to work on this - just to clarify I'm developing 
against uClibc-ng 1.0.9, noticed others suggesting this and 
wanted to make it clear.

Re. the architecture - it is an armv7a as 'uname -a' says:
'Linux fs 3.4.39 #249 SMP PREEMPT Wed Oct 4 12:07:05 MYT 2017 
armv7l GNU/Linux'

I could not produce any working binary by specifying the armv7a 
architecture to ldc, so I used the generic arm architecture for 
gnueabihf, as previously stated.

I managed to get the druntime tester running (minus some math 
functions and memstream) except for one specific blocking issue - 
Thread.suspend does not work, it produces a segfault.
To test this I commented out all suspendAll/resumeAll unittests 
from core.thread and stubbed out GC.collect().

This issue is not linked to the GC, as the segfault happens even 
when disabling the GC.collect function and enable the 
suspendAll/resumeAll unittests, the GC just happens to use the 
suspend mechanics and exposes the core issue.

 From what I can see in gdb 'thread_resumeHandler' is to blame, it 
looks like 'sem_post( &suspendCount )' will immediately trigger 
the resumeSignal and the call for 'sigsuspend( &sigres )' is 
never made.

Like:

464                     status = sem_post( &suspendCount );
(gdb) n

Thread 2 "druntime-test-r" received signal SIGUSR2, User defined 
signal 2.
0x001b46d0 in core.thread.thread_suspendHandler(int).op(void*) 
(sp=0xb572f900 "$F\033") at thread.d:464
464                     status = sem_post( &suspendCount );
(gdb) info threads
   Id   Target Id         Frame
   1    Thread 16005.16005 "druntime-test-r" 0x001ba7a0 in 
_D4core6thread5Fiber5stateMxFNaNbNdNiNfZEQBnQBlQBh5State 
(this=0xb6d34980) at thread.d:4533
* 2    Thread 16005.16273 "druntime-test-r" 0x001b46d0 in 
core.thread.thread_suspendHandler(int).op(void*) (sp=0xb572f900 
"$F\033") at thread.d:464
(gdb) bt
#0  0x001b46d0 in 
core.thread.thread_suspendHandler(int).op(void*) (sp=0xb572f900 
"$F\033") at thread.d:464
#1  0x001b483c in core.thread.callWithStackShell(scope 
void(void*) nothrow delegate) (fn=...) at thread.d:2600
#2  0x001b45f8 in thread_suspendHandler (sig=10) at thread.d:487
#3  0xfffffffe in ?? ()
Backtrace stopped: previous frame identical to this frame 
(corrupt stack?)
(gdb) n

Thread 2 "druntime-test-r" received signal SIGSEGV, Segmentation 
fault.
0xfffffffc in ?? ()
(gdb) bt
#0  0xfffffffc in ?? ()
#1  0xfffffffe in ?? ()
Backtrace stopped: previous frame identical to this frame 
(corrupt stack?)





More information about the digitalmars-d-ldc mailing list