Memory ordering properties of Atomic::r-m-w operations

Tue Nov 8 01:11:57 UTC 2016

On 6/11/2016 8:54 PM, Andrew Haley wrote:
> On 05/11/16 18:43, David Holmes wrote:
>> Forking new discussion from:
>>
>> RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
>>
>> On 1/11/2016 7:44 PM, Andrew Haley wrote:
>>> On 31/10/16 21:30, David Holmes wrote:
>>>>
>>>>
>>>> On 31/10/2016 7:32 PM, Andrew Haley wrote:
>>>>> On 30/10/16 21:26, David Holmes wrote:
>>>>>> On 31/10/2016 4:36 AM, Andrew Haley wrote:
>>>>
>>>>    // All of the atomic operations that imply a read-modify-write
>>>>    // action guarantee a two-way memory barrier across that
>>>>    // operation. Historically these semantics reflect the strength
>>>>    // of atomic operations that are provided on SPARC/X86. We assume
>>>>    // that strength is necessary unless we can prove that a weaker
>>>>    // form is sufficiently safe.
>>>
>>> Mmmm, but that doesn't say anything about a CAS that fails.  But fair
>>> enough, I accept your interpretation.
>>
>> Granted the above was not written with load-linked/store-conditional
>> style implementations in mind; and the historical behaviour on sparc
>> and x86 is not affected by failure of the cas, so it isn't called
>> out. I should fix that.
>>
>>>> But there is some contention as to whether the actual implementations
>>>> obey this completely.
>>>
>>> Linux/AArch64 uses GCC's __sync_val_compare_and_swap, which is specified
>>> as a
>>>
>>>   "full barrier".  That is, no memory operand is moved across the
>>>   operation, either forward or backward.  Further, instructions are
>>>   issued as necessary to prevent the processor from speculating loads
>>>   across the operation and from queuing stores after the operation.
>>>
>>> ... which reads the same as the language you quoted above, but looking
>>> at the assembly code I'm sure that it's really no stronger than a seq
>>> cst load followed by a seq cst store.
>>
>> Are you saying that a seq_cst load followed by a seq_cst store is weaker
>> than a full barrier?
>
> Probably.  I'm saying that when someone says "full barrier" they
> aren't exactly clear what that means.  I know what sequential
> consistency is, but not "full barrier" because it's used
> inconsistently.

Agreed it is not a term that has a common definition - it may just 
relate to no-reorderings of any loads or stores, or it may also imply 
visibility guarantees. Though while I know what "sequential consistency" 
is I do not know what exactly it means to implement an operation with 
seq_cst semantics.

> For example, the above says that no memory operand is moved across the
> barrier, but if you have
>
> store_relaxed(a)
> load_seq_cst(b)
> store_seq_cst(c)
> load_relaxed(d)
>
> there's nothing to prevent
>
> load_seq_cst(b)
> load_relaxed(d)
> store_relaxed(a)
> store_seq_cst(c)
>
> It is true that neither store a nor load d have moved across this
> operation, but they have exchanged places.  As far as GCC is concerned
> this is a correct implementation, and it does meet the requirement of
> sequential consistency as defined in the C++ memory model.

It does? Then it emphasises what I just said about not knowing what it 
means to implement an operation with seq_cst semantics. I would have 
expected full ordering of all loads and stores to get "sequential 
consistency".

>>> I guess maybe I could give up fighting this and implement all AArch64
>>> CAS sequences as
>>>
>>>    CAS(seq_cst); full fence
>>>
>>> or, even more extremely,
>>>
>>>    full fence; CAS(relaxed); full fence
>>>
>>> but it all seems unreasonably heavyweight.
>>
>> Indeed. A couple of issues here. If you are thinking in terms of
>> orderAccess::fence() then it needs to guarantee visibility as well as
>> ordering - see this bug I just filed:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8169193
>
> Ouch.  Yes, I agree that something needs fixing.  That comment:
>
> // Use release_store_fence to update values like the thread state,
> // where we don't want the current thread to continue until all our
> // prior memory accesses (including the new thread state) are visible
> // to other threads.
>
> ... seems very unhelpful, at least because a release fence (using
> conventional terminology) does not have that property: a release
> fence is only LoadStore|StoreStore.

In release_store_fence the release and fence are distinct memory 
ordering components. It is not a store combined with a "release fence" 
but a store between a "release" and a "fence". And critically in hotspot 
that "fence" must have visibility guarantees to ensure correctness of 
Dekker-duality algorithms.

Note the equivalence of release() with LoadStore|StoreStore is a 
definition within orderAccess.hpp, it is not a general equivalence.

>> So would be heavier than a "full barrier" that simply combined all
>> four storeload membar variants. Though of course the actual
>> implementation on a given architecture may be just as
>> heavyweight. And of course the Atomic op must guarantee visibility
>> of the successful store (else the atomicity aspect would not be
>> present).
>
> I don't think that's exactly right. As I understand the ARMv8 memory
> model, it's possible to have a CAS which imposes no memory ordering or
> visibility at all: it's a relaxed load and a relaxed store.  Other
> threads can still see stale values of the store unless they attempt a
> CAS.  This is really good: it's exactly what you want for some shared
> counters.

Okay - yes - a naked "relaxed" load need not see the result of a recent 
successful "CAS". But the load-with-reservation within a "CAS" must see 
such a store I would think, to ensure things work correctly - though I 
suppose that could also be handled at the store-with-reservation point. 
Which suggests that a CAS with a "full two-way memory barrier" on ARMv8 
does indeed need a fairly heavy pre- and post-op memory barrier (which 
makes me wonder whether the reservation using ld.acq and st.rel can be 
efficiently strengthened as needed, or whether plain ld and st would be 
more efficient within the overall sequence).

>> That aside we do not need two "fences" surrounding the atomic
>> op. For platforms where the atomic op is a single instruction which
>> combines load and store then conceptually all we need is:
>>
>> loadload|storeload; op; storeload|storestore
>>
>> Note this is at odds with the commentary in atomic.hpp which says things
>> like:
>>
>>    // <fence> add-value-to-dest <membar StoreLoad|StoreStore>
>>
>> I need to check why we settled on the above formulation - I suspect it
>> was conservatism. And of course for the cmpxchg it fails to account for
>> the fact there may not be a store to order with.

Just a note that, for example, SPARC does not require a CAS to succeed, 
for a subsequent membar to consider the CAS as a load+store.

>>
>> For load-linked/store-conditional based operations that would expand to
>> (assume a retry loop for unrelated store failures):
>>
>> loadLoad|storeLoad
>> temp = ld-linked &val
>> cmp temp, expected
>> jmp ne
>> st-cond &val, newVal
>> storeload|storestore
>>
>> which is fine if we actually store, but if we find the wrong value
>> there is no store for those final barriers to sync with. That then
>> raises the question: can subsequent loads and stores move into the
>> ld-linked/st-cond region? The general context-free answer would be
>> yes, but the actual details may be architecture specific and also
>> context dependent - ie the subsequent loads/stores may be dependent
>> on the CAS succeeding (or on it failing).  So without further
>> knowledge you would need to use a "full-barrier" after the st-cond.
>
> On most (all?) architectures a StoreLoad fence is a full barrier, so
> this formulation is equivalent to what I was saying anyway.

I'm trying to distinguish the desired semantics from any actual 
implementation mechanism. That fact that, for example, on SPARC and x86, 
the only explicit barrier needed is storeLoad, so if you have that then 
you effectively have a "full barrier" because the other three are 
implicit, is incidental.

Cheers,
David

> Andrew.
>