The GC and performance, but not what you expect

via Digitalmars-d digitalmars-d at puremagic.com
Fri May 30 08:54:57 PDT 2014


On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
> simplicity. But as soon as I added a single CAS I was already
> over the time that TCMalloc needs. That way I learned that CAS
> is not as cheap as it looks and the fastest allocators work
> thread local as long as possible.

22 cycles latency if on a valid cacheline?
+ overhead of going to memory

Did you try to add explicit prefetch, maybe that would help?

Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 
cycles on Haswell). You need instructions to fill the pipeline 
between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM 
and do a lot of testing on different CPUs. Explicit prefetching, 
lock free strategies etc are tricky to get right. Get it wrong 
and it is worse than the naive implementation.


More information about the Digitalmars-d mailing list