Memory ordering properties of Atomic::r-m-w operations

Sat Nov 5 18:43:52 UTC 2016

Forking new discussion from:

RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64

On 1/11/2016 7:44 PM, Andrew Haley wrote:
> On 31/10/16 21:30, David Holmes wrote:
>>
>>
>> On 31/10/2016 7:32 PM, Andrew Haley wrote:
>>> On 30/10/16 21:26, David Holmes wrote:
>>>> On 31/10/2016 4:36 AM, Andrew Haley wrote:
>>>>>
>>>>> And, while we're on the subject, is memory_order_conservative actually
>>>>> defined anywhere?
>>>>
>>>> No. It was chosen to represent the current status quo that the Atomic::
>>>> ops should all be (by default) full bi-directional fences.
>>>
>>> Does that mean that a CAS is actually stronger than a load acquire
>>> followed by a store release?  And that a CAS is a release fence even
>>> when it fails and no store happens?
>>
>> Yes. Yes.
>>
>>    // All of the atomic operations that imply a read-modify-write
>>    // action guarantee a two-way memory barrier across that
>>    // operation. Historically these semantics reflect the strength
>>    // of atomic operations that are provided on SPARC/X86. We assume
>>    // that strength is necessary unless we can prove that a weaker
>>    // form is sufficiently safe.
>
> Mmmm, but that doesn't say anything about a CAS that fails.  But fair
> enough, I accept your interpretation.

Granted the above was not written with load-linked/store-conditional 
style implementations in mind; and the historical behaviour on sparc and 
x86 is not affected by failure of the cas, so it isn't called out. I 
should fix that.

>> But there is some contention as to whether the actual implementations
>> obey this completely.
>
> Linux/AArch64 uses GCC's __sync_val_compare_and_swap, which is specified
> as a
>
>   "full barrier".  That is, no memory operand is moved across the
>   operation, either forward or backward.  Further, instructions are
>   issued as necessary to prevent the processor from speculating loads
>   across the operation and from queuing stores after the operation.
>
> ... which reads the same as the language you quoted above, but looking
> at the assembly code I'm sure that it's really no stronger than a seq
> cst load followed by a seq cst store.

Are you saying that a seq_cst load followed by a seq_cst store is weaker 
than a full barrier?

> I guess maybe I could give up fighting this and implement all AArch64
> CAS sequences as
>
>    CAS(seq_cst); full fence
>
> or, even more extremely,
>
>    full fence; CAS(relaxed); full fence
>
> but it all seems unreasonably heavyweight.

Indeed. A couple of issues here. If you are thinking in terms of 
orderAccess::fence() then it needs to guarantee visibility as well as 
ordering - see this bug I just filed:

https://bugs.openjdk.java.net/browse/JDK-8169193

So would be heavier than a "full barrier" that simply combined all four 
storeload membar variants. Though of course the actual implementation on 
a given architecture may be just as heavyweight. And of course the 
Atomic op must guarantee visibility of the successful store (else the 
atomicity aspect would not be present).

That aside we do not need two "fences" surrounding the atomic op. For 
platforms where the atomic op is a single instruction which combines 
load and store then conceptually all we need is:

loadload|storeload; op; storeload|storestore

Note this is at odds with the commentary in atomic.hpp which says things 
like:

   // <fence> add-value-to-dest <membar StoreLoad|StoreStore>

I need to check why we settled on the above formulation - I suspect it 
was conservatism. And of course for the cmpxchg it fails to account for 
the fact there may not be a store to order with.

For load-linked/store-conditional based operations that would expand to 
(assume a retry loop for unrelated store failures):

loadLoad|storeLoad
temp = ld-linked &val
cmp temp, expected
jmp ne
st-cond &val, newVal
storeload|storestore

which is fine if we actually store, but if we find the wrong value there 
is no store for those final barriers to sync with. That then raises the 
question: can subsequent loads and stores move into the 
ld-linked/st-cond region? The general context-free answer would be yes, 
but the actual details may be architecture specific and also context 
dependent - ie the subsequent loads/stores may be dependent on the CAS 
succeeding (or on it failing). So without further knowledge you would 
need to use a "full-barrier" after the st-cond.

David
-----

>>> And that a conservative load is a *store* barrier?
>>
>> Not sure what you mean. Atomic::load is not a r-m-w action so not
>> expected to be a two-way memory barrier.
>
> OK.
>
> Thanks,
>
> Andrew.
>