<html><body><p><tt><font size="2">>> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:<br>>> <br>>> Hi Kim, Erik, and Martin,<br>>> <br>>> Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. <br>>> <br>>> I uploaded a new webrev : </font></tt><tt><font size="2"><a href="http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/">http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/</a></font></tt><tt><font size="2"><br>>> This change uses forwardee_acquire(), which would generate better code on ARM. <br>>> <br>>> Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.<br>>> <br>>> I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.<br>><br>>Looks good.</font></tt><br><br><font size="2">Thanks a lot, Kim!</font><br><br><br><font size="2">Best regards,</font><br><font size="2">--</font><br><font size="2">Michihiro,</font><br><font size="2">IBM Research - Tokyo</font><br><br><img width="16" height="16" src="cid:1__=8FBB0831DFEF0FD68f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for Kim Barrett ---2018/06/05 05:08:48---> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com"><font size="2" color="#424282">Kim Barrett ---2018/06/05 05:08:48---> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote: ></font><br><br><font size="2" color="#5F5F5F">From: </font><font size="2">Kim Barrett <kim.barrett@oracle.com></font><br><font size="2" color="#5F5F5F">To: </font><font size="2">Michihiro Horie <HORIE@jp.ibm.com></font><br><font size="2" color="#5F5F5F">Cc: </font><font size="2">"Doerr, Martin" <martin.doerr@sap.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com>, "david.holmes@oracle.com" <david.holmes@oracle.com>, "Erik Österlund" <erik.osterlund@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net></font><br><font size="2" color="#5F5F5F">Date: </font><font size="2">2018/06/05 05:08</font><br><font size="2" color="#5F5F5F">Subject: </font><font size="2">Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><tt><font size="2">> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:<br>> <br>> Hi Kim, Erik, and Martin,<br>> <br>> Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. <br>> <br>> I uploaded a new webrev : </font></tt><tt><font size="2"><a href="http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/">http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/</a></font></tt><tt><font size="2"><br>> This change uses forwardee_acquire(), which would generate better code on ARM. <br>> <br>> Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.<br>> <br>> I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.<br><br>Looks good.<br><br>> <br>> <br>> Best regards,<br>> --<br>> Michihiro,<br>> IBM Research - Tokyo<br>> <br>> "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr".<br>> <br>> From: "Doerr, Martin" <martin.doerr@sap.com><br>> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com><br>> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net><br>> Date: 2018/05/30 16:18<br>> Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64<br>> <br>> <br>> <br>> <br>> Hi Erik,<br>> <br>> the current implementation works on PPC because of "MP+sync+addr".<br>> So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists.<br>> <br>> PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC.<br>> <br>> But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models.<br>> ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume.<br>> <br>> Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha.<br>> <br>> Thanks and best regards,<br>> Martin<br>> <br>> <br>> -----Original Message-----<br>> From: Erik Österlund [</font></tt><tt><font size="2"><a href="mailto:erik.osterlund@oracle.com">mailto:erik.osterlund@oracle.com</a></font></tt><tt><font size="2">] <br>> Sent: Dienstag, 29. Mai 2018 14:01<br>> To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com><br>> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net<br>> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64<br>> <br>> Hi Martin and Michihiro,<br>> <br>> On 2018-05-29 12:30, Doerr, Martin wrote:<br>> > Hi Kim,<br>> ><br>> > I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS.<br>> > So it sounds like a request to fix the current implementation in addition to what his original intend was.<br>> <br>> I think we are just trying to nail down the correct fencing and just go <br>> for that. And yes, this is arguably a pre-existing problem, but in a <br>> race involving the very same accesses that we are changing the fencing <br>> for. So it is not completely unrelated I suppose.<br>> <br>> In particular, hotspot has code that assumes that if you on the writer <br>> side issue a full fence before publishing a pointer to newly initialized <br>> data, then the initializing stores and their side effects should be <br>> globally "visible" across the system before the pointer to it is <br>> published, and hence elide the need for acquire on the loading side, <br>> without relying on retained data dependencies on the loader side. I <br>> believe this code falls under that category. It is assumed that the <br>> leading fence of the CAS publishing the forwarding pointer makes the <br>> initializing stores globally observable before publishing a pointer to <br>> the initialized data, hence assuming that any loads able to observe the <br>> new pointer would not rely on acquire or data dependent loads to <br>> correctly read the initialized data.<br>> <br>> Unfortunately, this is not reliable in the IRIW case, as per the litmus <br>> test "MP+sync+ctrl" as described in "Understanding POWER <br>> multiprocessors" (</font></tt><tt><font size="2"><a href="https://dl.acm.org/citation.cfm?id=1993520">https://dl.acm.org/citation.cfm?id=1993520</a></font></tt><tt><font size="2">), as <br>> opposed to "MP+sync+addr" that gets away with it because of the data <br>> dependency (not IRIW). Similarly, an isync does the job too on the <br>> reader side as shown in MP+sync+ctrlisync. So while what I believe was <br>> the previous reasoning that the leading sync of the CAS would elide the <br>> necessity for acquire on the reader side without relying on data <br>> dependent loads (implicit consume), I think that assumption was wrong in <br>> the first place and that we do indeed need explicit acquire (even with <br>> the precious conservative CAS fencing) in this context to not rely on <br>> implicit consume semantics generating the required data dependent loads <br>> on the reader side. In practice though, the leading sync of the CAS has <br>> been enough to generate the correct machine code. Now, with the leading <br>> sync removed, we are increasing the possible holes in the generated <br>> machine code due to this flawed reasoning. So it would be nice to do <br>> something more sound instead that does not make such assumptions.<br>> <br>> > Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way.<br>> > What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()?<br>> > The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.<br>> <br>> Sure, that sounds good to me.<br>> <br>> Thanks,<br>> /Erik<br>> <br>> > Thanks and best regards,<br>> > Martin<br>> ><br>> ><br>> > -----Original Message-----<br>> > From: Kim Barrett [</font></tt><tt><font size="2"><a href="mailto:kim.barrett@oracle.com">mailto:kim.barrett@oracle.com</a></font></tt><tt><font size="2">]<br>> > Sent: Dienstag, 29. Mai 2018 01:54<br>> > To: Michihiro Horie <HORIE@jp.ibm.com><br>> > Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com><br>> > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64<br>> ><br>> >> On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:<br>> >><br>> >> Hi Erik,<br>> >><br>> >> Thank you very much for your review.<br>> >><br>> >> I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.<br>> >><br>> >> New webrev uses memory_order_acq_rel: </font></tt><tt><font size="2"><a href="http://cr.openjdk.java.net/~mhorie/8154736/webrev.10">http://cr.openjdk.java.net/~mhorie/8154736/webrev.10</a></font></tt><tt><font size="2"><br>> > This is missing the acquire barrier on the else branch for the initial test, so fails to meet<br>> > the previously described minimal requirements for even possibly being sufficient. Any<br>> > analysis of weakening the CAS barriers must consider that test and successor code.<br>> ><br>> > In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be<br>> > considered; the forwardee is being returned to caller(s) that will presumably do something<br>> > with that object.<br>> ><br>> > Since the whole point of this discussion is performance, any proposed change should come<br>> > with performance information.<br>> ><br><br><br><br></font></tt><br><br><BR>
</body></html>