Memory ordering properties of Atomic::r-m-w operations

Sun Nov 6 10:54:53 UTC 2016

On 05/11/16 18:43, David Holmes wrote:
> Forking new discussion from:
> 
> RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
> 
> On 1/11/2016 7:44 PM, Andrew Haley wrote:
>> On 31/10/16 21:30, David Holmes wrote:
>>>
>>>
>>> On 31/10/2016 7:32 PM, Andrew Haley wrote:
>>>> On 30/10/16 21:26, David Holmes wrote:
>>>>> On 31/10/2016 4:36 AM, Andrew Haley wrote:
>>>
>>>    // All of the atomic operations that imply a read-modify-write
>>>    // action guarantee a two-way memory barrier across that
>>>    // operation. Historically these semantics reflect the strength
>>>    // of atomic operations that are provided on SPARC/X86. We assume
>>>    // that strength is necessary unless we can prove that a weaker
>>>    // form is sufficiently safe.
>>
>> Mmmm, but that doesn't say anything about a CAS that fails.  But fair
>> enough, I accept your interpretation.
> 
> Granted the above was not written with load-linked/store-conditional
> style implementations in mind; and the historical behaviour on sparc
> and x86 is not affected by failure of the cas, so it isn't called
> out. I should fix that.
> 
>>> But there is some contention as to whether the actual implementations
>>> obey this completely.
>>
>> Linux/AArch64 uses GCC's __sync_val_compare_and_swap, which is specified
>> as a
>>
>>   "full barrier".  That is, no memory operand is moved across the
>>   operation, either forward or backward.  Further, instructions are
>>   issued as necessary to prevent the processor from speculating loads
>>   across the operation and from queuing stores after the operation.
>>
>> ... which reads the same as the language you quoted above, but looking
>> at the assembly code I'm sure that it's really no stronger than a seq
>> cst load followed by a seq cst store.
> 
> Are you saying that a seq_cst load followed by a seq_cst store is weaker 
> than a full barrier?

Probably.  I'm saying that when someone says "full barrier" they
aren't exactly clear what that means.  I know what sequential
consistency is, but not "full barrier" because it's used
inconsistently.

For example, the above says that no memory operand is moved across the
barrier, but if you have

store_relaxed(a)
load_seq_cst(b)
store_seq_cst(c)
load_relaxed(d)

there's nothing to prevent

load_seq_cst(b)
load_relaxed(d)
store_relaxed(a)
store_seq_cst(c)

It is true that neither store a nor load d have moved across this
operation, but they have exchanged places.  As far as GCC is concerned
this is a correct implementation, and it does meet the requirement of
sequential consistency as defined in the C++ memory model.

>> I guess maybe I could give up fighting this and implement all AArch64
>> CAS sequences as
>>
>>    CAS(seq_cst); full fence
>>
>> or, even more extremely,
>>
>>    full fence; CAS(relaxed); full fence
>>
>> but it all seems unreasonably heavyweight.
> 
> Indeed. A couple of issues here. If you are thinking in terms of 
> orderAccess::fence() then it needs to guarantee visibility as well as 
> ordering - see this bug I just filed:
> 
> https://bugs.openjdk.java.net/browse/JDK-8169193

Ouch.  Yes, I agree that something needs fixing.  That comment:

// Use release_store_fence to update values like the thread state,
// where we don't want the current thread to continue until all our
// prior memory accesses (including the new thread state) are visible
// to other threads.

... seems very unhelpful, at least because a release fence (using
conventional terminology) does not have that property: a release
fence is only LoadStore|StoreStore.

> So would be heavier than a "full barrier" that simply combined all
> four storeload membar variants. Though of course the actual
> implementation on a given architecture may be just as
> heavyweight. And of course the Atomic op must guarantee visibility
> of the successful store (else the atomicity aspect would not be
> present).

I don't think that's exactly right. As I understand the ARMv8 memory
model, it's possible to have a CAS which imposes no memory ordering or
visibility at all: it's a relaxed load and a relaxed store.  Other
threads can still see stale values of the store unless they attempt a
CAS.  This is really good: it's exactly what you want for some shared
counters.

> That aside we do not need two "fences" surrounding the atomic
> op. For platforms where the atomic op is a single instruction which
> combines load and store then conceptually all we need is:
> 
> loadload|storeload; op; storeload|storestore
>
> Note this is at odds with the commentary in atomic.hpp which says things 
> like:
> 
>    // <fence> add-value-to-dest <membar StoreLoad|StoreStore>
> 
> I need to check why we settled on the above formulation - I suspect it 
> was conservatism. And of course for the cmpxchg it fails to account for 
> the fact there may not be a store to order with.
> 
> For load-linked/store-conditional based operations that would expand to 
> (assume a retry loop for unrelated store failures):
> 
> loadLoad|storeLoad
> temp = ld-linked &val
> cmp temp, expected
> jmp ne
> st-cond &val, newVal
> storeload|storestore
> 
> which is fine if we actually store, but if we find the wrong value
> there is no store for those final barriers to sync with. That then
> raises the question: can subsequent loads and stores move into the
> ld-linked/st-cond region? The general context-free answer would be
> yes, but the actual details may be architecture specific and also
> context dependent - ie the subsequent loads/stores may be dependent
> on the CAS succeeding (or on it failing).  So without further
> knowledge you would need to use a "full-barrier" after the st-cond.

On most (all?) architectures a StoreLoad fence is a full barrier, so
this formulation is equivalent to what I was saying anyway.

Andrew.