On heap segregation, GC optimization and @nogc relaxing
via Digitalmars-d
digitalmars-d at puremagic.com
Wed Nov 12 00:38:13 PST 2014
On Wednesday, 12 November 2014 at 02:34:55 UTC, deadalnix wrote:
> The problem at hand here is ownership of data.
"ownership of data" is one possible solution, but not the problem.
We are facing 2 problems:
1. A performance problem: Concurrency in writes (multiple
writers, one writer, periodical locking during clean up etc).
2. A structural problem: Releasing resources correctly.
I suggest that the ownership focus is on the latter, to support
solid non-GC implementations. Then rely on conventions for
multi-threading.
> - Being unsafe and rely on convention. This is the C++ road
> (and a possible road in D). It allow to implement almost any
> wanted scheme, but come at great cost for the developer.
All performant solutions are going to be "unsafe" in the sense
that you need to select a duplication/locking level that are
optimal for the characteristics of the actual application.
Copying data when you have no writers is too inefficient in real
applications.
Hardware support for transactional memory is going to be the easy
approach for speeding up locking.
> - Annotations. This is the Rust road. It also come a great
I think Rust's approach would favour a STM approach where you
create thread local copies for processing then merge the result
back into the "shared" memory.
> Immutability+GC allow to have safety while keeping interfaces
> simple. That is of great value. It also come with some nice
> goodies, in the sense that is it easy and safe to shared data
> without bookkeeping, allowing one to fit more in cache, and
> reduce the amount of garbage created.
How does GC fit more data in the cache? A GC usually has overhead
and would typically generate more cache-misses due to unreachable
in-cache ("hot") memory not being available for reallocation.
> Relying on convention has the advantage that any scheme can be
> implemented without constraint, while keeping interface simple.
> The obvious drawback is that it is time consuming and error
> prone. It also make a lot of things unclear, and dev choose the
> better safe than sorry road. That mean excessive copying to
> make sure one own the data, which is wasteful (in term of work
> for the copy itself, garbage generation and cache pressure). If
> this must be an option locally for system code, it doesn't
> seems like this is the right option at program scale and we do
> it in C++ simply because we have to.
>
> Finally, annotations are a great way to combine safety and
> speed, but generally come at a great cost when implenting
> uncommon ownership strategies where you ends up having to
> express complex lifetime and ownership relations.
The core problem is that if you are unhappy with single-threaded
applications then you are looking for high throughput using
multi-threading. And in that case sacrificing performance by not
using the optimal strategy becomes problematic.
The optimal strategy is entirely dependent on the application and
the dataset.
Therefore you need to support multiple approaches:
- per data structure GC
- thread local GC
- lock annotations of types or variables
- speculative lock optimisations (transactional memory)
And in the future you also will need to support the integration
of GPU/Co-processors into mainstream CPUs. Metal and OpenCL is
only a beginning…
> Ideally, we want to map with what the hardware does. So what
> does the hardware do ?
That changes over time. The current focus in upcoming hardware is
on:
1. Heterogenous architecture with high performance co-processors
2. Hardware support for transactional memory
Intel CPUs might have buffered transactional memory within 5
years.
> from one core to the other. They are bad at shared writable
> data (as effectively, the cache line will have to bounce back
> and forth between cores, and all memory access will need to be
> serialized instead of performed out of order).
This will vary a lot. On x86 you can write to a whole cache line
(buffered) without reading it first and it uses a convenient
cache coherency protocol (so that reads/write ops are in order).
This is not true for all CPUs.
I agree with others that say that a heterogeneous approach, like
C++, is the better alternative. If parity with C++ is important
then D needs to look closer at OpenMP, but that probably goes
beyond what D can achieve in terms of implementation.
Some observations:
1. If you are not to rely on conventions for sync'ing threads
then you need a pretty extensive framework if you want good
performance.
2. Safety will harm performance.
3. Safety with high performance levels requires a very
complicated static analysis that will probably not work very well
for larger programs.
4. For most applications performance will come through
co-processors (GPGPU etc).
5. If hardware progresses faster than compiler development, then
you will never reach the performance frontier…
I think D needs to cut down on implementation complexity and
ensure that the implementation time can catch up with hardware
developments. The way to do it is:
1. Accept that generally performant multi-threaded code is unsafe
and application/hardware optimized.
2. Focus on making @nogc single-threaded code robust and fast.
And I agree that ownership is key.
3. Use semantic analysis to automatically generate a tailored
runtime with application-optimized allocators.
More information about the Digitalmars-d
mailing list