Who Ordered Memory Fences on an x86?

Wed Nov 5 22:00:31 PST 2008

Nick Sabalausky wrote:
> Call me a grumpy old fart, but I'd be happy just tossing fences in 
> everywhere (when a multicore is detected) and be done with the whole mess, 
> just because trying to wring every little bit of speed from, say, a 3+ GHz 
> multicore processor strikes me as a highly unworthy pursuit. I'd rather 
> optimize for the lower end and let the fancy overpriced crap handle it 
> however it will.
> 
> And that's even before tossing in the consideration that (to my dismay) most 
> code these days is written in languages/platforms (ex, "Ajaxy" web-apps) 
> that throw any notion of performance straight into the trash anyway (what's 
> 100 extra cycles here and there, when the browser/interpreter/OS/whatever 
> makes something as simple as navigation and text entry less responsive than 
> it was on a 1MHz 6502?).

Bartosz, Andrei, Sean and I have discussed this at length. My personal 
view is that nobody actually understands the proper use of fences (the 
CPU documentation on exactly what they do is frustratingly obtuse, which 
does not help at all). Then there's the issue of fences behaving very 
differently on different CPUs. If you use explicit fences, you have no 
hope of portability.

To address this, the idea we've been tossing about is to allow the only 
operations on shared variables to be read and write, implemented as 
compiler intrinsics:

shared int x;
...
int y = shared_read(x);
shared_write(x, y + 1);

which implements: int y = x++;

(Forget the names of the intrinsics for the moment.)

Yes, it's painfully explicit. But it's easy to visually verify 
correctness, and one can grep for them for code review purposes. Each 
shared_read and shared_write are guaranteed to be sequentially 
consistent, within a thread as well as among multiple threads.

How they are implemented is up to the compiler. The compiler can do the 
naive approach and lard them up with airtight fences, or a more advanced 
compiler can do data flow analysis and compute a reasonable minimum 
number of fences required.

The point here is that only *one* person needs to know how the fences 
actually work on the target CPU, the person who writes the compiler back 
end. And even that person only needs to solve the problem once. I think 
that's a far more tractable problem than trying to educate every 
programmer out there on the subtleties of fences for every CPU variant.

Yes, this screws down very tightly what can be done with shared 
variables. Once we get this done, and get it right, we'll be able to see 
much more clearly where the right places are to loosen those screws.