[concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

Mon Dec 1 20:40:51 UTC 2014

Hans,

(Thanks for your excellent work on C/C++ 11 and your eternal patience)

On Tue, Nov 25, 2014 at 11:15 AM, Hans Boehm <boehm at acm.org> wrote:
> It seems to me that a (dubiuously named) loadFence is intended to have
> essentially the same semantics as the (perhaps slightly less dubiously
> named) C++ atomic_thread_fence(memory_order_acquire), and a storeFence
> matches atomic_thread_fence(memory_order_release).  The C++ standard and,
> even more so, Mark Batty's work have a precise definition of what those mean
> in terms of implied "synchronizes with" relationships.
>
> It looks to me like this whole implementation model for volatiles in terms
> of fences is fundamentally doomed, and it probably makes more sense to get
> rid of it rather than spending time on renaming it (though we just did the
> latter in Android to avoid similar confusion about semantics).  It's

I would also like to see alignment to leverage the technical and
cultural work done on C11.  I would like to see Unsafe get
load-acquire and store-release methods and these should be used in
preference to fences where possible.  I'd like to see the C11 wording
reused as much as possible.  The meanings of the words "acquire" and
"release" are now "owned" by the C11 community and we should tag
along.

A better API for Unsafe would be

putOrdered -> storeRelease
put -> storeRelaxed
(ordinary volatile write) -> store (default is sequential consistent)

etc ...

but the high cost of renaming methods in Unsafe probably makes this a
no-go, even though Unsafe is not a public API in theory.

At least the documentation of all the methods should indicate what the
memory effects and the corresponding C++11 memory model interpretation
is.

E.g. Unsafe.compareAndSwap should document the memory effects, i.e.
sequential consistency.

Unsafe doesn't currently have a readAcquire method (mirror of
putOrdered) probably because volatile read is _almost_ the same (but
not on ppc!).

> fundamentally incompatible with the way volatiles/atomics are intended to be
> implemented on ARMv8 (and Itanium).  Which I think fundamentally get this
> much closer to right than traditional fence-based ISAs.
>
> I'm no hardware architect, but fundamentally it seems to me that
>
> load x
> acquire_fence
>
> imposes a much more stringent constraint than
>
> load_acquire x
>
> Consider the case in which the load from x is an L1 hit, but a preceding
> load (from say y) is a long-latency miss.  If we enforce ordering by just
> waiting for completion of prior operation, the former has to wait for the
> load from y to complete; while the latter doesn't.  I find it hard to
> believe that this doesn't leave an appreciable amount of performance on the
> table, at least for some interesting microarchitectures.

I agree.  Fences should be used rarely.