On heap segregation, GC optimization and @nogc relaxing

Wed Nov 12 00:38:13 PST 2014

On Wednesday, 12 November 2014 at 02:34:55 UTC, deadalnix wrote:
> The problem at hand here is ownership of data.

"ownership of data" is one possible solution, but not the problem.

We are facing 2 problems:

1. A performance problem: Concurrency in writes (multiple 
writers, one writer, periodical locking during clean up etc).

2. A structural problem: Releasing resources correctly.

I suggest that the ownership focus is on the latter, to support 
solid non-GC implementations. Then rely on conventions for 
multi-threading.

>  - Being unsafe and rely on convention. This is the C++ road 
> (and a possible road in D). It allow to implement almost any 
> wanted scheme, but come at great cost for the developer.

All performant solutions are going to be "unsafe" in the sense 
that you need to select a duplication/locking level that are 
optimal for the characteristics of the actual application. 
Copying data when you have no writers is too inefficient in real 
applications.

Hardware support for transactional memory is going to be the easy 
approach for speeding up locking.

>  - Annotations. This is the Rust road. It also come a great

I think Rust's approach would favour a STM approach where you 
create thread local copies for processing then merge the result 
back into the "shared" memory.

> Immutability+GC allow to have safety while keeping interfaces 
> simple. That is of great value. It also come with some nice 
> goodies, in the sense that is it easy and safe to shared data 
> without bookkeeping, allowing one to fit more in cache, and 
> reduce the amount of garbage created.

How does GC fit more data in the cache? A GC usually has overhead 
and would typically generate more cache-misses due to unreachable 
in-cache ("hot") memory not being available for reallocation.

> Relying on convention has the advantage that any scheme can be 
> implemented without constraint, while keeping interface simple. 
> The obvious drawback is that it is time consuming and error 
> prone. It also make a lot of things unclear, and dev choose the 
> better safe than sorry road. That mean excessive copying to 
> make sure one own the data, which is wasteful (in term of work 
> for the copy itself, garbage generation and cache pressure). If 
> this must be an option locally for system code, it doesn't 
> seems like this is the right option at program scale and we do 
> it in C++ simply because we have to.
>
> Finally, annotations are a great way to combine safety and 
> speed, but generally come at a great cost when implenting 
> uncommon ownership strategies where you ends up having to 
> express complex lifetime and ownership relations.

The core problem is that if you are unhappy with single-threaded 
applications then you are looking for high throughput using 
multi-threading. And in that case sacrificing performance by not 
using the optimal strategy becomes problematic.

The optimal strategy is entirely dependent on the application and 
the dataset.

Therefore you need to support multiple approaches:

- per data structure GC
- thread local GC
- lock annotations of types or variables
- speculative lock optimisations (transactional memory)

And in the future you also will need to support the integration 
of GPU/Co-processors into mainstream CPUs. Metal and OpenCL is 
only a beginning…

> Ideally, we want to map with what the hardware does. So what 
> does the hardware do ?

That changes over time. The current focus in upcoming hardware is 
on:

1. Heterogenous architecture with high performance co-processors

2. Hardware support for transactional memory

Intel CPUs might have buffered transactional memory within 5 
years.

> from one core to the other. They are bad at shared writable 
> data (as effectively, the cache line will have to bounce back 
> and forth between cores, and all memory access will need to be 
> serialized instead of performed out of order).

This will vary a lot. On x86 you can write to a whole cache line 
(buffered) without reading it first and it uses a convenient 
cache coherency protocol (so that reads/write ops are in order). 
This is not true for all CPUs.

I agree with others that say that a heterogeneous approach, like 
C++, is the better alternative.  If parity with C++ is important 
then D needs to look closer at OpenMP, but that probably goes 
beyond what D can achieve in terms of implementation.

Some observations:

1. If you are not to rely on conventions for sync'ing threads 
then you need a pretty extensive framework if you want good 
performance.

2. Safety will harm performance.

3. Safety with high performance levels requires a very 
complicated static analysis that will probably not work very well 
for larger programs.

4. For most applications performance will come through 
co-processors (GPGPU etc).

5. If hardware progresses faster than compiler development, then 
you will never reach the performance frontier…

I think D needs to cut down on implementation complexity and 
ensure that the implementation time can catch up with hardware 
developments. The way to do it is:

1. Accept that generally performant multi-threaded code is unsafe 
and application/hardware optimized.

2. Focus on making @nogc single-threaded code robust and fast. 
And I agree that ownership is key.

3. Use semantic analysis to automatically generate a tailored 
runtime with application-optimized allocators.