RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Kim Barrett
kim.barrett at oracle.com
Mon Jun 4 20:08:40 UTC 2018
> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE at jp.ibm.com> wrote:
>
> Hi Kim, Erik, and Martin,
>
> Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
>
> I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/
> This change uses forwardee_acquire(), which would generate better code on ARM.
>
> Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
>
> I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Looks good.
>
>
> Best regards,
> --
> Michihiro,
> IBM Research - Tokyo
>
> "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr".
>
> From: "Doerr, Martin" <martin.doerr at sap.com>
> To: "Erik Österlund" <erik.osterlund at oracle.com>, Kim Barrett <kim.barrett at oracle.com>, Michihiro Horie <HORIE at jp.ibm.com>, "Andrew Haley (aph at redhat.com)" <aph at redhat.com>
> Cc: "david.holmes at oracle.com" <david.holmes at oracle.com>, "hotspot-gc-dev at openjdk.java.net" <hotspot-gc-dev at openjdk.java.net>, "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-port-dev at openjdk.java.net>
> Date: 2018/05/30 16:18
> Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
>
>
>
>
> Hi Erik,
>
> the current implementation works on PPC because of "MP+sync+addr".
> So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists.
>
> PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC.
>
> But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models.
> ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume.
>
> Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha.
>
> Thanks and best regards,
> Martin
>
>
> -----Original Message-----
> From: Erik Österlund [mailto:erik.osterlund at oracle.com]
> Sent: Dienstag, 29. Mai 2018 14:01
> To: Doerr, Martin <martin.doerr at sap.com>; Kim Barrett <kim.barrett at oracle.com>; Michihiro Horie <HORIE at jp.ibm.com>
> Cc: david.holmes at oracle.com; Gustavo Bueno Romero <gromero at br.ibm.com>; hotspot-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
>
> Hi Martin and Michihiro,
>
> On 2018-05-29 12:30, Doerr, Martin wrote:
> > Hi Kim,
> >
> > I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS.
> > So it sounds like a request to fix the current implementation in addition to what his original intend was.
>
> I think we are just trying to nail down the correct fencing and just go
> for that. And yes, this is arguably a pre-existing problem, but in a
> race involving the very same accesses that we are changing the fencing
> for. So it is not completely unrelated I suppose.
>
> In particular, hotspot has code that assumes that if you on the writer
> side issue a full fence before publishing a pointer to newly initialized
> data, then the initializing stores and their side effects should be
> globally "visible" across the system before the pointer to it is
> published, and hence elide the need for acquire on the loading side,
> without relying on retained data dependencies on the loader side. I
> believe this code falls under that category. It is assumed that the
> leading fence of the CAS publishing the forwarding pointer makes the
> initializing stores globally observable before publishing a pointer to
> the initialized data, hence assuming that any loads able to observe the
> new pointer would not rely on acquire or data dependent loads to
> correctly read the initialized data.
>
> Unfortunately, this is not reliable in the IRIW case, as per the litmus
> test "MP+sync+ctrl" as described in "Understanding POWER
> multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as
> opposed to "MP+sync+addr" that gets away with it because of the data
> dependency (not IRIW). Similarly, an isync does the job too on the
> reader side as shown in MP+sync+ctrlisync. So while what I believe was
> the previous reasoning that the leading sync of the CAS would elide the
> necessity for acquire on the reader side without relying on data
> dependent loads (implicit consume), I think that assumption was wrong in
> the first place and that we do indeed need explicit acquire (even with
> the precious conservative CAS fencing) in this context to not rely on
> implicit consume semantics generating the required data dependent loads
> on the reader side. In practice though, the leading sync of the CAS has
> been enough to generate the correct machine code. Now, with the leading
> sync removed, we are increasing the possible holes in the generated
> machine code due to this flawed reasoning. So it would be nice to do
> something more sound instead that does not make such assumptions.
>
> > Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way.
> > What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()?
> > The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
>
> Sure, that sounds good to me.
>
> Thanks,
> /Erik
>
> > Thanks and best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: Kim Barrett [mailto:kim.barrett at oracle.com]
> > Sent: Dienstag, 29. Mai 2018 01:54
> > To: Michihiro Horie <HORIE at jp.ibm.com>
> > Cc: Erik Osterlund <erik.osterlund at oracle.com>; david.holmes at oracle.com; Gustavo Bueno Romero <gromero at br.ibm.com>; hotspot-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net; Doerr, Martin <martin.doerr at sap.com>
> > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
> >
> >> On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE at jp.ibm.com> wrote:
> >>
> >> Hi Erik,
> >>
> >> Thank you very much for your review.
> >>
> >> I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
> >>
> >> New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
> > This is missing the acquire barrier on the else branch for the initial test, so fails to meet
> > the previously described minimal requirements for even possibly being sufficient. Any
> > analysis of weakening the CAS barriers must consider that test and successor code.
> >
> > In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be
> > considered; the forwardee is being returned to caller(s) that will presumably do something
> > with that object.
> >
> > Since the whole point of this discussion is performance, any proposed change should come
> > with performance information.
> >
More information about the ppc-aix-port-dev
mailing list