Who Ordered Memory Fences on an x86?
Walter Bright
newshound1 at digitalmars.com
Wed Nov 5 22:00:31 PST 2008
Nick Sabalausky wrote:
> Call me a grumpy old fart, but I'd be happy just tossing fences in
> everywhere (when a multicore is detected) and be done with the whole mess,
> just because trying to wring every little bit of speed from, say, a 3+ GHz
> multicore processor strikes me as a highly unworthy pursuit. I'd rather
> optimize for the lower end and let the fancy overpriced crap handle it
> however it will.
>
> And that's even before tossing in the consideration that (to my dismay) most
> code these days is written in languages/platforms (ex, "Ajaxy" web-apps)
> that throw any notion of performance straight into the trash anyway (what's
> 100 extra cycles here and there, when the browser/interpreter/OS/whatever
> makes something as simple as navigation and text entry less responsive than
> it was on a 1MHz 6502?).
Bartosz, Andrei, Sean and I have discussed this at length. My personal
view is that nobody actually understands the proper use of fences (the
CPU documentation on exactly what they do is frustratingly obtuse, which
does not help at all). Then there's the issue of fences behaving very
differently on different CPUs. If you use explicit fences, you have no
hope of portability.
To address this, the idea we've been tossing about is to allow the only
operations on shared variables to be read and write, implemented as
compiler intrinsics:
shared int x;
...
int y = shared_read(x);
shared_write(x, y + 1);
which implements: int y = x++;
(Forget the names of the intrinsics for the moment.)
Yes, it's painfully explicit. But it's easy to visually verify
correctness, and one can grep for them for code review purposes. Each
shared_read and shared_write are guaranteed to be sequentially
consistent, within a thread as well as among multiple threads.
How they are implemented is up to the compiler. The compiler can do the
naive approach and lard them up with airtight fences, or a more advanced
compiler can do data flow analysis and compute a reasonable minimum
number of fences required.
The point here is that only *one* person needs to know how the fences
actually work on the target CPU, the person who writes the compiler back
end. And even that person only needs to solve the problem once. I think
that's a far more tractable problem than trying to educate every
programmer out there on the subtleties of fences for every CPU variant.
Yes, this screws down very tightly what can be done with shared
variables. Once we get this done, and get it right, we'll be able to see
much more clearly where the right places are to loosen those screws.
More information about the Digitalmars-d
mailing list