[Dlang-internal] GC experts: Performance when using many small ranges?

Thu Jul 27 23:23:07 PDT 2017

I've already asked on the main newsgroup, but seems this didn't 
catch the attention of our GC experts:
http://forum.dlang.org/thread/oka1vo$4sr$1@digitalmars.com

Basically I want to get emulated TLS working in GDC and wonder 
whether we could somehow integrate with the GCC emutls code. We'd 
need to post some patches for the libgcc emutls code  so I'm 
interested in the best way to implement the GC scanning, 
particularly regarding performance.

The main problem is that GCC emutls allocates every single TLS 
variable in every thread using a malloc call. So we have lots of 
independent memory ranges. How does the GC perform in such 
situations, assuming I add an interface to libgcc to iterate all 
allocated memory ranges and use the scanDG delegate in 
rt.sections / rt.tlsgc?

An alternative could be to somehow implement support for custom 
allocators in GCC emutls and allocate all out D TLS variables 
using the GC. We'd still have to scan the per-thread TLS pointer 
array to avoid pinning all GC allocations, but this should work. 
Main drawback is a large bloat in the data segment to store a 
pointer to the allocation function for every variable.

(FYI, more details about the GCC emutls implementation are given 
in the linked forum thread)

So what do you think is best for GC performance? Option 1 would 
be a rather simple extension in libgcc, option 2 is more 
intrusive.

-- Johannes