Who Ordered Memory Fences on an x86?

Wed Nov 5 22:25:10 PST 2008

"Walter Bright" <newshound1 at digitalmars.com> wrote in message 
news:geu161$91s$1 at digitalmars.com...
> Nick Sabalausky wrote:
>> Call me a grumpy old fart, but I'd be happy just tossing fences in 
>> everywhere (when a multicore is detected) and be done with the whole 
>> mess, just because trying to wring every little bit of speed from, say, a 
>> 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
>> rather optimize for the lower end and let the fancy overpriced crap 
>> handle it however it will.
>>
>> And that's even before tossing in the consideration that (to my dismay) 
>> most code these days is written in languages/platforms (ex, "Ajaxy" 
>> web-apps) that throw any notion of performance straight into the trash 
>> anyway (what's 100 extra cycles here and there, when the 
>> browser/interpreter/OS/whatever makes something as simple as navigation 
>> and text entry less responsive than it was on a 1MHz 6502?).
>
> Bartosz, Andrei, Sean and I have discussed this at length. My personal 
> view is that nobody actually understands the proper use of fences (the CPU 
> documentation on exactly what they do is frustratingly obtuse, which does 
> not help at all). Then there's the issue of fences behaving very 
> differently on different CPUs. If you use explicit fences, you have no 
> hope of portability.
>

>From reading the article, I was under the impression that not using explicit 
fences lead to CPUs inevitably making false assumptions and thus spitting 
out erroneus results. So it sounds like explicit fences are a case of 
"dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere 
and you get unportable machine code. Don't use explicit fences and you get 
errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little 
nclear on, is this whole mess only applicable when multiple cores are in 
use, or do the same problems crop up on unicore chips?

> To address this, the idea we've been tossing about is to allow the only 
> operations on shared variables to be read and write, implemented as 
> compiler intrinsics:
>
> shared int x;
> ...
> int y = shared_read(x);
> shared_write(x, y + 1);
>
> which implements: int y = x++;
>
> (Forget the names of the intrinsics for the moment.)
>
> Yes, it's painfully explicit. But it's easy to visually verify 
> correctness, and one can grep for them for code review purposes. Each 
> shared_read and shared_write are guaranteed to be sequentially consistent, 
> within a thread as well as among multiple threads.
>

I volunteer to respond to inevitable "Why does D's shared memory access 
syntax suck so badly?" inqueries with "You can thank the CPU vendors for 
that" ;)  (Or do I misunderstand the root issue?)

> How they are implemented is up to the compiler. The compiler can do the 
> naive approach and lard them up with airtight fences, or a more advanced 
> compiler can do data flow analysis and compute a reasonable minimum number 
> of fences required.
>
> The point here is that only *one* person needs to know how the fences 
> actually work on the target CPU, the person who writes the compiler back 
> end. And even that person only needs to solve the problem once. I think 
> that's a far more tractable problem than trying to educate every 
> programmer out there on the subtleties of fences for every CPU variant.
>
> Yes, this screws down very tightly what can be done with shared variables. 
> Once we get this done, and get it right, we'll be able to see much more 
> clearly where the right places are to loosen those screws.

Seems to make sense.

Maybe I'm being naive, but would it have made more sense for CPUs to assume 
memory accesses *cannot* be reordered unless told otherwise, instead of the 
other way around?