Some GC and emulated TLS questions (GDC related)

Fri Jul 14 02:13:26 PDT 2017

As you might know, GDC currently doesn't properly hook up the GC to the
GCC emulated TLS support in libgcc. Because of that, TLS memory is not
scanned on target systems with emulated TLS. For GCC this includes
MinGW, Android (although Google switched to LLVM anyway) and some more
architectures. Proper integration likely needs some modifications in
the libgcc emutls code so I need some more information about the GC to
really propose a good solution.

The main problem is that GCC emutls does not use contiguous memory
blocks. So instead of scanning one range containing N variables we'll
have one range for every single TLS variable per thread.
So assuming we could iterate over all these variables (this would be
an extension required in libgcc), would scanTLSRanges in rt.sections
produce acceptable performance in these cases? Depending on the
number of TLS variables and threads there may be thousands of ranges
to scan.

Another solution could be to enhance libgcc emutls to allow custom
allocators, then have a special allocation function in druntime for all
D emutls variables. As far as I know there is no GC heap that is
scanned, but not automatically collected? I'd need a way to completely
manually manage GC.malloc/GC.free memory without the GC collecting this
memory, but still scanning this memory for pointers. Does something
like this exist?

Another option is simply using the DMD-style emutls. But as far as I can
see the DMD implementation never supported dynamic loading of shared
libraries? This is something the GCC emutls support is quite good at:
It doesn't have any platform dependencies (apart from mutexes and some
way to store one thread specific pointer+destructor) and should work
with all kinds of shared library combinations. DMD style emutls also
does not allow sharing TLS variables between D and other languages.

So I was thinking, if DMD style emutls really doesn't support shared
libraries, maybe we should just clone a GCC-style, compiler and OS
agnostic emutls implementation into druntime? A D implementation could
simply allocate all internal arrays using the GC. This should be just
as efficient as the C implementation for variable access and interfacing
to the GC is trivial. It gets somewhat more complicated if we want to
use this in betterC though. We also lose C/C++ compatibility though by
using such a custom implementation.

The rest of this post is a description of the GCC emutls code. Someone
can use this specification to implement a clean-room design D emutls
clone.
Source code can be found here, but beware of the GPL license:
https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

Unlike DMD TLS, the GCC TLS code does not put all initialization memory
into one section. In fact, the code is completely runtime and
compile time linker agnostic so it can't use section start/stop
markers. Instead, every TLS variable is handled individually. For every
variable, an instance of __emutls_object is created in the (writeable)
data segment. __emutls_object is defined as:

struct __emutls_object
{
    word size;
    word align;
    union {pointer offset; void* ptr};
    void* templ;
}

The void* ptr is only used as an optimization for single threaded
programs, so I'll ignore this for now in the further description.

Whenever such a variable is accessed, the compiler calls
__emutls_get_address(&(__emutls_object in data segment)). This function
first does an atomic load of the __emutls_object.offset variable. If it
is zero, this particular TLS variable has not been accessed in any
thread before.

If this is the case, first check if the global emutls
initialization function (emutls_init) has been run already, if not run
it (__gthread_once). The initialization function initializes the mutex
variable and creates a thread local variable using __gthread_key_create
with the destructor function set to emutls_destroy.

Back to __emutls_get_address: If offset was zero and we ran the
emutls_init if required, we now lock the mutex. We have a global
variable emutls_size to count the number of total variables. We now
increase the emutls_size counter and atomically set
__emutls_object.offset = emutls_size.

We now have an __emutls_object.offset index assigned. Either using the
procedure described above or maybe we're called at a later stage again
and offset was already != zero. Now we get a per-thread pointer using
__gthread_getspecific. This is a pointer to an __emutls_array which is
simply a size value, followed by size void*. If
__gthread_getspecific returns null this is the first time we access a
TLS variable in this thread. Then allocate a new __emutls_array (size =
emutls_size + 32 + 1(for the size field)) and save using
__gthread_setspecific. If we already had an array for this thread,
check if __emutls_object.offset index is larger than the array. Then
reallocate the array (double the size, if still to small add +32, then
either way add +1). Update using __gthread_setspecific.

Now we have enough space in the thread-specific array in either case, so
look at array[offset-1]. If this is null, allocate a new object
(emutls_alloc) and set the array value. Return the array value at index
offset-1.

The emutls_alloc function is simple: Allocate __emutls_object.size
bytes with __emutls_object.align alignment. In order to ensure
alignment, the libgcc implementation uses malloc, then manually adjusts
the pointer. As the original pointer is needed for free, the
implementation allocates void*.sizeof more bytes and stores the
original malloc pointer at the start of the allocated data block. The
returned value is offset by void*.sizeof into the data block. Finally
copy __emutls_object.templ into the newly allocated data block
(initialization).

The last missing function is emutls_destroy: Called by __gthread once a
thread key gets destroyed it receives a void* argument pointing to the
per-thread array. The code now simply iterates over the array, gets the
original pointers (offset -1 in the allocated blocks) and frees the
data.

-- Johannes