The GC and performance, but not what you expect

Fri May 30 20:57:48 PDT 2014

Am Fri, 30 May 2014 15:54:57 +0000
schrieb "Ola Fosheim Grøstad"
<ola.fosheim.grostad+dlang at gmail.com>:

> On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
> > simplicity. But as soon as I added a single CAS I was already
> > over the time that TCMalloc needs. That way I learned that CAS
> > is not as cheap as it looks and the fastest allocators work
> > thread local as long as possible.
> 
> 22 cycles latency if on a valid cacheline?
> + overhead of going to memory
> 
> Did you try to add explicit prefetch, maybe that would help?
> 
> Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 
> cycles on Haswell). You need instructions to fill the pipeline 
> between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM 
> and do a lot of testing on different CPUs. Explicit prefetching, 
> lock free strategies etc are tricky to get right. Get it wrong 
> and it is worse than the naive implementation.

I'm on a Core 2 Duo. But this doesn't sound like I want to try
it. core.atomic is as low as I wanted to go. Anyway I deleted
that code when I realized just how fast allocation is with
TCMalloc already. And that's a general purpose allocator.

-- 
Marco