Compiler optimization breaks multi-threaded code

Tue Nov 16 20:49:15 PST 2010

stephan Wrote:

> Am 16.11.2010 18:09, schrieb Sean Kelly:
> > cas() contains an asm block. Though I guess in this case the compiler
> > isn't actually optimizing across it. Does atomic!"+="(&cnt, 1) work
> > correctly?  I know the issue with shared would still have to be fixed,
> > but that code uses asm for the load as well, so it probably won't be
> > optimized the same way.
> 
> Thanks for looking into the issue around here. Just three comments from 
> my side, Sean.
> 
> Disclaimer: based on a couple of hours chasing a bug and not much D 
> experience (but some optimizing C++ compiler experience - so the issue 
> looked familiar :-) )
> 
> 1) atomicOp is not concerned. You only read memory once in the function 
> call. Whether from a local variable that was loaded from something 
> global or directly from a global, doesn't really matter (except for 
> timing, maybe).

atomicOp uses a CAS loop for the RMW operations.  The load in asm code was actually to bypass the synchronization generated by the compiler since it isn't needed in this case, but I guess it helped anyway.  I'm wondering if I should change the binary op to use asm too, so the compiler doesn't do anything evil there until shared is implemented more fully.

> 2) You are right, the compiler seems to not optimize across asm 
> statements. So, the example can be fixed with the following hack:
>      void atomicInc  ( ) {
>          uint o;
>          while ( !cas( &cnt, o, o + 1 ) ) {
>              asm { nop; } o = cnt;
>          }
>      }
> This is however more brittle than it looks, because it is not always 
> clear what "optimizing across an asm block". This version has the issue 
> again:
>      void atomicInc  ( ) {
>          uint o = cnt;
>          do {
>              asm { nop; } o = cnt;
>          } while ( !cas( &cnt, o, o + 1 ) )
>      }
> While this case might look somewhat obvious, I encountered some problems 
> in more complex code, and finally went for the all-inline-assembler 
> solution to be on the safe side.
> 
> 3) During my debugging, I believe that I saw the optimizer not only 
> re-ordering reads of shared variables, but also writes to shared 
> variables. IIRC, my Dekker example on SO (which fails for the missing 
> s/l/mfence instructions), also sports a re-ordering of the lines
>          cnt++;
>          turn2 = true; flag1 = false;
> into
>          turn2 = true;
>          cnt++;
>          flag1 = false;
> which in this case is not really important, but might introduce another 
> bug if I was prepared to live with the risk of starvation (and remove 
> turn2). If the compiler would still re-order (haven't tested), cnt++ 
> would be outside of the critical section.

I'm thinking of exposing atomicStore and atomicLoad in core.atomic so folks have something to use until the compiler work is taken care of.  At least then your code could still look somewhat normal.