Openwrt Linux Uclibc ARM GC issue
Radu
void at null.pt
Wed Jan 10 00:27:47 UTC 2018
On Sunday, 17 December 2017 at 19:05:04 UTC, Joakim wrote:
> On Sunday, 17 December 2017 at 17:12:41 UTC, Radu wrote:
>> On Friday, 15 December 2017 at 14:24:08 UTC, David Nadlinger
>> wrote:
>>> On 15 Dec 2017, at 14:06, Radu via digitalmars-d-ldc wrote:
>>>> When run, I get this error spuriously:
>>>>
>>>> ====================================
>>>> core.exception.AssertError at rt/sections_elf_shared.d(116):
>>>> Assertion failure
>>>> Fatal error in EH code: _Unwind_RaiseException failed with
>>>> reason code: 9
>>>> Aborted (core dumped)
>>>> ====================================
>>>
>>> The assert is inside an invariant which checks that the TLS
>>> information has been extracted successfully. Perhaps uclibc
>>> uses a TLS implementation that is not ABI-compatible with
>>> glibc? (druntime needs to determine the TLS ranges to
>>> register them with the GC, for the main thread as well as
>>> newly spawned ones.)
>>>
>>> Where in the program lifecycle does the error occur? From the
>>> backtrace, it looks like during C runtime startup, in which
>>> case I am not quite seeing the connection to the GC.
>>>
>>> Why unwinding fails is another question, but not one I would
>>> be terribly worried about – it is possible that the error
>>> e.g. just occurs too early for the EH machinery to be
>>> properly set up yet. Other low-level parts of druntime have
>>> been converted to directly abort (e.g. using assert(0))
>>> instead. In fact, I am about to overhaul sections_elf_shared
>>> in that respect anyway to improve error reporting when mixing
>>> shared and non-shared builds.
>>>
>>> — David
>>
>> My various attempts on getting it to run behaved very erratic.
>> So I changed the parameters for cross compile, basically I
>> removed all architecture specifics leaving only
>> `-mtriple=arm-linux-gnueabihf`, and `-mfloat-abi=hard` on C
>> side.
>>
>> My testing hardware is a ARM Cortex-A7,
>> http://linux-sunxi.org/A33
>
> I believe that triple defaults to ARMv5, are you sure your
> Openwrt kernel is built for ARMv7? Try running uname -m on the
> device to check. For example, most low- to mid-level
> smartphones these days ship with ARMv8 chips but the kernel is
> only built for 32-bit ARMv7, so they can only run 32-bit apps.
>
>> With the compiler switches changed I could run my test program
>> and try the druntime test runner (albeit with some changes on
>> math and stdio to get it linking):
>>
>> ./druntime-test-runner
>> 0.000s PASS release32 core.atomic
>> 0.000s PASS release32 core.bitop
>> 0.000s PASS release32 core.checkedint
>> 0.005s PASS release32 core.demangle
>> 0.000s PASS release32 core.exception
>> 0.002s PASS release32 core.internal.arrayop
>> 0.000s PASS release32 core.internal.convert
>> 0.000s PASS release32 core.internal.hash
>> 0.000s PASS release32 core.internal.string
>> 0.000s PASS release32 core.math
>> 0.000s PASS release32 core.memory
>> 0.002s PASS release32 core.sync.barrier
>> 0.015s PASS release32 core.sync.condition
>> 0.000s PASS release32 core.sync.config
>> 0.016s PASS release32 core.sync.mutex
>> 0.016s PASS release32 core.sync.rwmutex
>> 0.002s PASS release32 core.sync.semaphore
>> Segmentation fault (core dumped)
>>
>> The seg fault is from core.thread:1351
>>
>> unittest
>> {
>> auto t1 = new Thread({
>> foreach (_; 0 .. 20)
>> Thread.getAll;
>> }).start;
>> auto t2 = new Thread({
>> foreach (_; 0 .. 20)
>> GC.collect; // this seg faults
>> }).start;
>> t1.join();
>> t2.join();
>> }
>>
>> Calling GC.collect from the main thread doesn't seg fault.
>
> Try running core.thread alone and see if it makes a difference,
> ./druntime-test-runner core.thread, as I've sometimes seen
> tested modules interfere with each other. I see that there are
> a few places where Glibc is assumed in core.thread, make sure
> those are right on Uclibc too:
>
> https://github.com/ldc-developers/druntime/blob/ldc-v1.6.0/src/core/thread.d#L3301
> https://github.com/ldc-developers/druntime/blob/ldc-v1.6.0/src/core/thread.d#L3410
>
> You can also try skipping those tests that segfault for now and
> make sure everything else works, by adding something like
> version(skip) before that failing unittest block, so you know
> the extent of the test problems.
>
>> Core dump is not very helpful - stack is garbage, but running
>> with gdbserver a minimal program with the unit test I can see
>> this:
>>
>> Thread 1 "test" received signal SIGUSR1, User defined signal 1.
>> pthread_getattr_np (thread_id=0, attr=0xb6b302bc) at
>> libpthread/nptl/pthread_getattr_np.c:47
>> 47 iattr->schedpolicy = thread->schedpolicy;
>> (gdb) step
>>
>> Thread 1 "test" received signal SIGUSR2, User defined signal 2.
>> 0xb6e50d80 in epoll_wait (epfd=-1090521272, events=0x8,
>> maxevents=2, timeout=-1224756080) at
>> libc/sysdeps/linux/common/epoll.c:58
>> 58 CANCELLABLE_SYSCALL(int, epoll_wait, (int epfd, struct
>> epoll_event *events, int maxevents, int timeout),
>> (gdb) step
>>
>> Thread 1 "test" received signal SIGSEGV, Segmentation fault.
>> 0xfffffffc in ?? ()
>> (gdb)
>
> The SIGUSR1/SIGUSR2 signals mean the GC ran fine. You'd need
> to delve more into the code and the implementation details
> mentioned above to track this down.
>
> On Sunday, 17 December 2017 at 17:20:32 UTC, Radu wrote:
>> Yes - latest LDC versions make cross compiling a breeze so
>> kudos to you guys for making this happening. I'm using Linux
>> subsystem for Window btw. so for me this is even more fun as I
>> can work on both environments natively :)
>
> Yeah, you could just use the Windows ldc too, assuming you have
> a cross-compiler from that OS, as shown on the wiki for Windows
> with the Android NDK.
>
>> The modifications need it surface deep are very few - some
>> math and memory streams functions are missing.
>
> I don't know how much it differs from Glibc, but we'd always be
> interested in a port, assuming you have the time to submit a
> pull like this recent one for Musl:
>
> https://github.com/dlang/druntime/pull/1997
>
>> The road block looks to be somewhere in the GC and TLS, or the
>> interaction of them (at least this is my feeling ATM)
>
> Not being able to do an explicit collect there isn't that big a
> deal: I'd skip that test for now and run everything else, then
> come back to that one once you have an idea of the bigger
> picture.
Got some time to work on this - just to clarify I'm developing
against uClibc-ng 1.0.9, noticed others suggesting this and
wanted to make it clear.
Re. the architecture - it is an armv7a as 'uname -a' says:
'Linux fs 3.4.39 #249 SMP PREEMPT Wed Oct 4 12:07:05 MYT 2017
armv7l GNU/Linux'
I could not produce any working binary by specifying the armv7a
architecture to ldc, so I used the generic arm architecture for
gnueabihf, as previously stated.
I managed to get the druntime tester running (minus some math
functions and memstream) except for one specific blocking issue -
Thread.suspend does not work, it produces a segfault.
To test this I commented out all suspendAll/resumeAll unittests
from core.thread and stubbed out GC.collect().
This issue is not linked to the GC, as the segfault happens even
when disabling the GC.collect function and enable the
suspendAll/resumeAll unittests, the GC just happens to use the
suspend mechanics and exposes the core issue.
From what I can see in gdb 'thread_resumeHandler' is to blame, it
looks like 'sem_post( &suspendCount )' will immediately trigger
the resumeSignal and the call for 'sigsuspend( &sigres )' is
never made.
Like:
464 status = sem_post( &suspendCount );
(gdb) n
Thread 2 "druntime-test-r" received signal SIGUSR2, User defined
signal 2.
0x001b46d0 in core.thread.thread_suspendHandler(int).op(void*)
(sp=0xb572f900 "$F\033") at thread.d:464
464 status = sem_post( &suspendCount );
(gdb) info threads
Id Target Id Frame
1 Thread 16005.16005 "druntime-test-r" 0x001ba7a0 in
_D4core6thread5Fiber5stateMxFNaNbNdNiNfZEQBnQBlQBh5State
(this=0xb6d34980) at thread.d:4533
* 2 Thread 16005.16273 "druntime-test-r" 0x001b46d0 in
core.thread.thread_suspendHandler(int).op(void*) (sp=0xb572f900
"$F\033") at thread.d:464
(gdb) bt
#0 0x001b46d0 in
core.thread.thread_suspendHandler(int).op(void*) (sp=0xb572f900
"$F\033") at thread.d:464
#1 0x001b483c in core.thread.callWithStackShell(scope
void(void*) nothrow delegate) (fn=...) at thread.d:2600
#2 0x001b45f8 in thread_suspendHandler (sig=10) at thread.d:487
#3 0xfffffffe in ?? ()
Backtrace stopped: previous frame identical to this frame
(corrupt stack?)
(gdb) n
Thread 2 "druntime-test-r" received signal SIGSEGV, Segmentation
fault.
0xfffffffc in ?? ()
(gdb) bt
#0 0xfffffffc in ?? ()
#1 0xfffffffe in ?? ()
Backtrace stopped: previous frame identical to this frame
(corrupt stack?)
More information about the digitalmars-d-ldc
mailing list