RFR(M): 8202080: Introduce ordering semantics for Atomic::add

Wed Apr 25 22:03:42 UTC 2018

Hi Martin,

On 25/04/2018 11:13 PM, Doerr, Martin wrote:
> Hi David,
> 
>> Correctness comes first, performance comes second.
> Agreed.
> 
> What exactly do you expect to "demonstrate the correctness"?
> A mathematical proof would need a complete precise description which is not available because the set of involved variables is not specified.
> I think a feasible and sufficient way is to have some developers who are familiar with the code check all related variables they are aware of. AFAIK this was done to some extend during the earlier review in which the missing release barrier was found and the findings were addressed.

That process was completely ad-hoc and relied on other people to try and 
imagine what the relaxed semantics might allow that would be wrong in 
practice! That is not the way to approach these kinds of changes.

A mathematical proof is of course not realistic but there has to be a 
demonstration that the code involved has been analysed in a robust 
manner to establish that required properties of the code still hold 
under the relaxed semantics. And the owners of the code involved (e.g. 
GC for cmpxchg changes) need to be the ones to sign-off on this - they 
are the people who need to be persuaded that the change is in fact 
safe/correct.

The bugs in this kind of code are very subtle and rarely exposed through 
normal levels of testing.

David
-----

> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Mittwoch, 25. April 2018 14:56
> To: Doerr, Martin <martin.doerr at sap.com>; Erik Österlund <erik.osterlund at oracle.com>; hotspot-runtime-dev at openjdk.java.net; Aleksey Shipilev <shade at redhat.com>; Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Michihiro Horie (HORIE at jp.ibm.com) <HORIE at jp.ibm.com>
> Subject: Re: RFR(M): 8202080: Introduce ordering semantics for Atomic::add
> 
> Hi Martin,
> 
> On 25/04/2018 12:30 AM, Doerr, Martin wrote:
>> Hi David,
>>
>> I have good news for you. I've asked around here if anybody has strong objections against using memory_order_conservative by default and there were none.
>>
>> We can accept using it by default for all atomics as long as we get a little support to get fixes for performance bottlenecks reviewed. One example is Michihiro's RFE JDK-8154736.
> 
> My position on this is very simple and very clear. If you want to
> introduce changes to memory ordering semantics in shared code then you
> need to demonstrate the correctness of such a change. Correctness comes
> first, performance comes second.
> 
> David
> -----
> 
> 
>> If everybody is fine with that, I'll create a new webrev and use memory_order_conservative with double-fence for all atomics on PPC64 and s390 by default.
>>
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Montag, 23. April 2018 13:36
>> To: Erik Österlund <erik.osterlund at oracle.com>; Doerr, Martin <martin.doerr at sap.com>; hotspot-runtime-dev at openjdk.java.net; Aleksey Shipilev <shade at redhat.com>; Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Michihiro Horie (HORIE at jp.ibm.com) <HORIE at jp.ibm.com>
>> Subject: Re: RFR(M): 8202080: Introduce ordering semantics for Atomic::add
>>
>> Hi Erik,
>>
>> On 23/04/2018 7:44 PM, Erik Österlund wrote:
>>> Hi David,
>>>
>>> On 2018-04-22 14:30, David Holmes wrote:
>>>> Erik,
>>>>
>>>> On 21/04/2018 5:53 PM, Erik Österlund wrote:
>>>>> Hi Martin and David,
>>>>>
>>>>> I for one welcome using C++11 atomic semantics, and would like to
>>>>> outline why.
>>>>>
>>>>> 1) It is compliant with the JMM. Work has been done to unify the two.
>>>>> I think I outlined for my access decorators which JMM property links
>>>>> to each MO_ decorator, which corresponds to a C++11 semantic, that I
>>>>> also have a short description of.
>>>>> 2) It has a well defined meaning, as lots of people with more time on
>>>>> their hands than us have spent many hours defining what it means.
>>>>> 3) It is well known, and new developers that know either of C++11
>>>>> atomics or the JMM can immediately start working with it.
>>>>> 4) Since it is so well-known, compilers know about it too. For
>>>>> platforms that don't offer a whole bunch of ABIs for these semantics,
>>>>> like x86, we have the opportunity to use intrinsics, and let the
>>>>> compiler know what we are trying to do. This allows more freedom such
>>>>> as using consume, that we can not represent with volatiles and inline
>>>>> assembly.
>>>>> 5) For similar reasons, the hardware know about it too. If we had a
>>>>> notion of seq_cst (which is what this counter needs), the AArch64
>>>>> hardware could provide the most appropriate instructions that map
>>>>> directly to that.
>>>>> 6) For our TSO platforms, it will make little difference. But for SAP
>>>>> and RedHat, this makes a much bigger difference, to the extent that
>>>>> PPC can not pragmatically use the conservative model we have
>>>>> provided. Therefore we have ended up in a situation where we have a
>>>>> contract that is not followed, and people like me writing workarounds
>>>>> out of politeness. I would rather have a weaker contract I can rely
>>>>> on, that is actually implemented properly in all backends, than have
>>>>> a strong contract that looks great, but is too unpragmatic to be
>>>>> implemented in the very backends where it matters the most, and hence
>>>>> is unreliable in practice.
>>>>
>>>> It has not been established that the weaker contract can be relied
>>>> upon. Simply saying "it seems to work fine" is not good enough. Plenty
>>>> of code with missing memory barriers seemed to work fine, right up
>>>> until it stopped working fine.
>>>>
>>>> I have no issues with expanding our support for memory access modes -
>>>> as long as we clearly document exactly what semantics they have - but
>>>> before applying a relaxed mode to an existing conservative operation
>>>> you have to "prove" that it is correct to do so. As I mentioned the
>>>> last time this was attempted with a relaxed cmpxchg, it was shown to
>>>> not be a trivial exercise to get right.
>>>
>>> Just to be clear, I was talking about how I welcome being *able* to use
>>> weaker semantics in Atomic to accommodate the platforms that need it. I
>>> was not talking about whether the existing callsites of Atomic::add
>>> should be downgraded to a weaker ordering without analysis what the
>>> consequences would be. I think that is what you are skeptical against. I
>>> am too, and do not know how I feel about that. But let's analyze what
>>> would happen in terms of fencing for the platforms in our repo if that
>>> was the case.
>>>
>>> I think the proposal is to change Atomic::add to have acq_rel semantics
>>> by default (which is what in practice PPC has been using), which would
>>> downgrade the expected memory ordering requirements on existing
>>> callsites to Atomic::add. Let's have a look at what would happen on
>>> those call sites on a selection of platforms:
>>>
>>> 1) x86/x64: No difference. Was a lock xadd, will continue to be so.
>>> 2) SPARC: No difference. Was load phi; cas phi; loop. Will continue be so.
>>> 3) AArch64: No difference. The GCC bindings expand to exclusive
>>> load-acquire/release-store loop. Will continue to be so (as the AArch64
>>> interpretation of acquire/release is sequentially consistent, and the
>>> AArch64 port apparently decided to stop at sequential consistency).
>>> 4) ARM: The ARM AArch64 port uses explicit ldaxr/stlxr bindings. Same
>>> storyas above. The non-AArch64 case uses a ldrex/strex loop with both
>>> leading and trailing dmb sy, when LL/SC is available and otherwise uses
>>> a not fenced at all loop, presumably because we already know there is no
>>> concurrency in such systems.
>>> 4) S390: No difference. Used atomic load and ALU or CAS loop, depending
>>> on hardware availability, and will continue to do so.
>>> 5) PPC: Can now remove bidirectional sync, and replace with leading
>>> lwsync and trailing branch + isync. This is seemingly the only platform
>>> yet that can take advantage of this and generate better code. Therefore,
>>> it is the only platform where things would look difference. But they
>>> already skipped the official contract and used fencing corresponding to
>>> acq_rel since a relatively long time. Therefore, in practice, they would
>>> not change either.
>>
>> The implementation is not the issue, it is the semantics. If the call
>> sites expect full bidirectional fences then that is what should be
>> provided unless it can be shown that acquire-release semantics are
>> sufficient for those callsites. If a new x86 chip came out tomorrow that
>> allowed acq-rel weaker than use of "lock", and we coded to that, then
>> code may start to break if it actually requires the existing stronger
>> fences.
>>
>>> So I guess for the ports in our repo, we will continue to generate the
>>> same code as we did before, whether we have changed to a weaker contract
>>> or not. So it seems to me that whatever risk we take by blindly
>>> downgrading our Atomic::add callsites to be acq_rel, from the expected
>>> "conservative", is a risk that will only affect the PPC port (in our
>>> repo). And it is a risk that they have already taken.
>>
>> It's a hole that other platforms may unwittingly fall into in the
>> future. And it's a hole we've fallen into in the past when sledge-hammer
>> barriers hid the fact the wrong semantic barrier was being used.
>>
>>> Having said that, it would of course feel better if somebody had gone
>>> through all the call sites and checked that acq_rel is indeed
>>> sufficient. Perhaps SAP has done such analysis already?
>>
>> If they have then that should be shared. Otherwise shared code should
>> remain using the existing conservative barrier, until it is shown to be
>> safe to do otherwise.
>>
>> Cheers,
>> David
>>
>>> Thanks,
>>> /Erik
>>>
>>>> David
>>>> -----
>>>>
>>>>> 7) The weaker contract forces us to think more about our lock-free
>>>>> interactions and why it works. But I would argue we should not write
>>>>> lock-free code if we do not understand why it works anyway.
>>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>>
>>>>> On 2018-04-21 01:08, David Holmes wrote:
>>>>>> Hi Martin,
>>>>>>
>>>>>> You've added:
>>>>>>
>>>>>>     enum cmpxchg_memory_order {
>>>>>>       memory_order_relaxed = 0,
>>>>>> !   memory_order_acquire = 2,
>>>>>> !   memory_order_release = 3,
>>>>>> !   memory_order_acq_rel = 4,
>>>>>>
>>>>>> but not defined them. What do they mean? Are they the same as
>>>>>> C++11/17? Do we actually want that? There was a discussion on all
>>>>>> this when the enum was introduced. Also note the enum is named
>>>>>> cmpxchg_memory_order - if we're going to generalise this then it
>>>>>> should be renamed and behaviour clearly specified.
>>>>>>
>>>>>> !   inline static D add(I add_value, D volatile* dest,
>>>>>> !                       cmpxchg_memory_order order =
>>>>>> memory_order_acq_rel);
>>>>>>
>>>>>> The default should be conservative - to match what is stated in
>>>>>> atomic.hpp regarding expectations for all atomic r-m-w operations.
>>>>>>
>>>>>>    63   // All of the atomic operations that imply a read-modify-write
>>>>>> action
>>>>>>     64   // guarantee a two-way memory barrier across that operation.
>>>>>> Historically
>>>>>>     65   // these semantics reflect the strength of atomic operations
>>>>>> that are
>>>>>>     66   // provided on SPARC/X86. We assume that strength is
>>>>>> necessary unless
>>>>>>     67   // we can prove that a weaker form is sufficiently safe.
>>>>>>
>>>>>> That's how "conservative" was defined when this was introduced. Some
>>>>>> of the ports chose to not follow this "contract" and that's up to
>>>>>> them, but only those ports should be utilising weaker operations
>>>>>> unless it can be proven that a weaker form is always correct. The
>>>>>> onus is on the person changing the code. (There was long discussion
>>>>>> when a relaxed cmpxchg was attempted to be introduced and it was
>>>>>> shown to not be correct.)
>>>>>>
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>> On 21/04/2018 2:04 AM, Doerr, Martin wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> JDK-8195099 "Concurrent safe-memory-reclamation mechanism"
>>>>>>> introduced a usage of Atomic::add for which we're not using the
>>>>>>> correct memory barriers on PPC64 and s390.
>>>>>>>
>>>>>>> Some more description can be found in the bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8202080
>>>>>>>
>>>>>>> We can fix this by the following change:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~mdoerr/8202080_atomic_add/webrev.00/
>>>>>>>
>>>>>>> Please note that I only implemented the platform code for linux
>>>>>>> x86, PPC64 and s390 so far. The remaining ones should be trivial
>>>>>>> and I'll take care of them after I got some feedback.
>>>>>>>
>>>>>>> I can also do the same change for other atomic functions like sub,
>>>>>>> inc and dec if you like.
>>>>>>>
>>>>>>> Intention of this change is only to fix this new code and possibly
>>>>>>> other future enhancements, not to change the behavior of any older
>>>>>>> usages of Atomic::add.
>>>>>>>
>>>>>>> Feedback is welcome.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>
>>>