Atomic Ref Counting

Tue Nov 23 17:13:31 PST 2010

On 11/21/2010 8:59 PM, dsimcha wrote:
> This message originally appeared on the Phobos list, but I've grown impatient
> waiting for responses and decided it needs to see a wider audience:
>
> Now that I have time, I did some benchmarks of atomic ref counting.  It is
> somewhat expensive, but not obscenely so.  Here's my benchmark:
>
> import std.stdio, std.datetime, std.container;
>
> void main() {
>      Array!uint arr;
>      arr.length = 4;
>
>      auto sw = StopWatch(autoStart);
>      foreach(i; 0..10_000_000) {
>          auto arr2 = arr;
>      }
>
>      writeln(sw.peek.milliseconds);
> }
>
> Here's a diff of std.typecons:
>
> Index: typecons.d
> ===================================================================
> --- typecons.d    (revision 2181)
> +++ typecons.d    (working copy)
> @@ -2328,7 +2328,16 @@
>       this(this)
>       {
>           if (!RefCounted.isInitialized) return;
> -        ++RefCounted._store._count;
> +        //++RefCounted._store._count;
> +        auto p =&(RefCounted._store._count);
> +        asm {
> +            push EAX;
> +            mov EAX, p;
> +            lock;
> +            inc dword ptr[EAX];
> +            pop EAX;
> +        }
> +
>           debug(RefCounted) if (RefCounted.debugging)
>                    writeln(typeof(this).stringof,
>                   "@", cast(void*) RefCounted._store, ": bumped refcount to ",
> @@ -2345,7 +2354,23 @@
>       {
>           if (!RefCounted._store) return;
>           assert(RefCounted._store._count>  0);
> -        if (--RefCounted._store._count)
> +
> +        // Can't just do lock; dec; here because I need the value afterwords.
> +        size_t oldVal = void;
> +        auto p =&(RefCounted._store._count);
> +        asm {
> +            push EAX;
> +            push EBX;
> +            mov EAX, size_t.max;  // + size_t.max is equivalent to - 1.
> +            mov EBX, p;
> +            lock;
> +            xadd [EBX], EAX;
> +            mov oldVal, EAX;
> +            pop EBX;
> +            pop EAX;
> +        }
> +
> +        if (oldVal>  1)
>           {
>               debug(RefCounted) if (RefCounted.debugging)
>                        writeln(typeof(this).stringof,
>
> Note the direct use of ASM.  Between function call overhead and the fact that
> core.atomic uses a CAS loop, using it instead would have about doubled the
> overhead.  I'm not sure how we'll factor this when/if this hits production code.
>
> Anyhow, the timings are:
>
> Atomic:  258 milliseconds total (258 nanoseconds per iteration)
> Non-atomic:  88 milliseconds  (88 nanoseconds per iteration)
>
> I think this is a small enough overhead that we should just make all reference
> counting atomic.  This would make the race condition pointed out by Michael
> Fortin completely moot IIUC.  It would also make it possible to safely share
> reference counted containers across threads in shared/synchronized wrappers, etc.
>
> Furthermore, the cost of atomic reference counting can be mitigated by the
> standard tricks for dealing with expensive to copy objects, though it
> shouldn't be necessary except in the most extreme conditions.

Did you try the inline ASM block without explicit register preservation? 
Walter replied to your message before 
(http://www.digitalmars.com/d/archives/digitalmars/D/Register_Preservation_in_Inline_ASM_Blocks_122470.html) 
and said the compiler will work around any registers you use in inline 
blocks. I tested it out and I can confirm he wasn't lying ;)

I'm not by a machine with dmd otherwise I'd try myself. Try removing the 
explicit push/pops of the registers and use ecx instead of ebx in the 
second diff (use of it forces dmd to push ebx in the prologue because 
its callee save). Might save a considerable amount of cycles.