Hi Kim, Erik, and Martin, Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM. Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites. I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough. Best regards, -- Michihiro, IBM Research - Tokyo From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Erik, the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists. PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC. But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume. Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha. Thanks and best regards, Martin -----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" ( https://dl.acm.org/citation.cfm?id=1993520 ), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me. Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.