RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64

Fri Jun 1 15:37:04 UTC 2018

>Hi Michihiro,
>
>Looks good to me.

Thanks a lot, Erik!

Best regards,
--
Michihiro,
IBM Research - Tokyo

From:	"Erik Österlund" <erik.osterlund at oracle.com>
To:	Michihiro Horie <HORIE at jp.ibm.com>, "Doerr, Martin"
            <martin.doerr at sap.com>
Cc:	"Andrew Haley (aph at redhat.com)" <aph at redhat.com>,
            "david.holmes at oracle.com" <david.holmes at oracle.com>,
            "hotspot-gc-dev at openjdk.java.net"
            <hotspot-gc-dev at openjdk.java.net>, Kim Barrett
            <kim.barrett at oracle.com>, "ppc-aix-port-dev at openjdk.java.net"
            <ppc-aix-port-dev at openjdk.java.net>
Date:	2018/06/02 00:15
Subject:	Re: RFR(M): 8154736: enhancement of cmpxchg and
            copy_to_survivor for ppc64

Hi Michihiro,

Looks good to me.

Thanks,
/Erik

On 2018-06-01 17:08, Michihiro Horie wrote:

      Hi Kim, Erik, and Martin,

      Thank you very much for reminding me that an acquire barrier in the
      else-statement for “!test_mark->is_marked()” is necessary under the
      criteria of not relying on the consume.

      I uploaded a new webrev :
      http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/
      This change uses forwardee_acquire(), which would generate better
      code on ARM.

      Necessary barriers are located in all the paths in
      copy_to_survivor_space, and the returned new_obj can be safely
      handled in the caller sites.

      I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved
      by 5%. Since my previous measurement with implicit consume showed 6%
      improvement, adding acquire barriers degraded the performance a
      little, but 5% is still good enough.

      Best regards,
      --
      Michihiro,
      IBM Research - Tokyo

      Inactive
          hide details for "Doerr, Martin" ---2018/05/30
      16:18:09---Hi Erik, the current implementation works on PPC
      because of "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the
      current implementation works on PPC because of "MP+sync+addr".

      From: "Doerr, Martin" <martin.doerr at sap.com>
      To: "Erik Österlund" <erik.osterlund at oracle.com>, Kim Barrett
      <kim.barrett at oracle.com>, Michihiro Horie <HORIE at jp.ibm.com>, "Andrew
      Haley (aph at redhat.com)" <aph at redhat.com>
      Cc: "david.holmes at oracle.com" <david.holmes at oracle.com>,
      "hotspot-gc-dev at openjdk.java.net" <hotspot-gc-dev at openjdk.java.net>,
      "ppc-aix-port-dev at openjdk.java.net"
      <ppc-aix-port-dev at openjdk.java.net>
      Date: 2018/05/30 16:18
      Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and
      copy_to_survivor for ppc64

      Hi Erik,

      the current implementation works on PPC because of "MP+sync+addr".
      So we already rely on ordering of "load volatile field" + "implicit
      consume" on the reader's side. We have never seen any issues related
      to this with the compilers we have been using during the ~10 years
      the PPC implementation exists.

      PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal
      doesn't make it unreliable for PPC.

      But I'm ok with evaluating acquire barriers although they are not
      required by the PPC/ARM memory models.
      ARM/aarch64 will also be affected when the o->forwardee uses
      load_acquire. So somebody should check the impact. If it is not
      acceptable we may need to introduce explicit consume.

      Implicit consume is also bad in shared code because somebody may want
      to run it on DEC Alpha.

      Thanks and best regards,
      Martin

      -----Original Message-----
      From: Erik Österlund [mailto:erik.osterlund at oracle.com]
      Sent: Dienstag, 29. Mai 2018 14:01
      To: Doerr, Martin <martin.doerr at sap.com>; Kim Barrett
      <kim.barrett at oracle.com>; Michihiro Horie <HORIE at jp.ibm.com>
      Cc: david.holmes at oracle.com; Gustavo Bueno Romero
      <gromero at br.ibm.com>; hotspot-dev at openjdk.java.net;
      hotspot-gc-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
      Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and
      copy_to_survivor for ppc64

      Hi Martin and Michihiro,

      On 2018-05-29 12:30, Doerr, Martin wrote:
      > Hi Kim,
      >
      > I'm trying to understand how this is related to Michihiro's change.
      The else path of the initial test is not affected by it AFAICS.
      > So it sounds like a request to fix the current implementation in
      addition to what his original intend was.

      I think we are just trying to nail down the correct fencing and just
      go
      for that. And yes, this is arguably a pre-existing problem, but in a
      race involving the very same accesses that we are changing the
      fencing
      for. So it is not completely unrelated I suppose.

      In particular, hotspot has code that assumes that if you on the
      writer
      side issue a full fence before publishing a pointer to newly
      initialized
      data, then the initializing stores and their side effects should be
      globally "visible" across the system before the pointer to it is
      published, and hence elide the need for acquire on the loading side,
      without relying on retained data dependencies on the loader side. I
      believe this code falls under that category. It is assumed that the
      leading fence of the CAS publishing the forwarding pointer makes the
      initializing stores globally observable before publishing a pointer
      to
      the initialized data, hence assuming that any loads able to observe
      the
      new pointer would not rely on acquire or data dependent loads to
      correctly read the initialized data.

      Unfortunately, this is not reliable in the IRIW case, as per the
      litmus
      test "MP+sync+ctrl" as described in "Understanding POWER
      multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as
      opposed to "MP+sync+addr" that gets away with it because of the data
      dependency (not IRIW). Similarly, an isync does the job too on the
      reader side as shown in MP+sync+ctrlisync. So while what I believe
      was
      the previous reasoning that the leading sync of the CAS would elide
      the
      necessity for acquire on the reader side without relying on data
      dependent loads (implicit consume), I think that assumption was wrong
      in
      the first place and that we do indeed need explicit acquire (even
      with
      the precious conservative CAS fencing) in this context to not rely on

      implicit consume semantics generating the required data dependent
      loads
      on the reader side. In practice though, the leading sync of the CAS
      has
      been enough to generate the correct machine code. Now, with the
      leading
      sync removed, we are increasing the possible holes in the generated
      machine code due to this flawed reasoning. So it would be nice to do
      something more sound instead that does not make such assumptions.

      > Anyway, I agree with that implicit consume is not good. And I think
      it would be good to treat both o->forwardee() the same way.
      > What about keeping memory_order_release for the CAS and using
      acquire for both o->forwardee()?
      > The case in which the CAS succeeds is safe because the current
      thread has created new_obj so it doesn't need memory barriers to
      access it.

      Sure, that sounds good to me.

      Thanks,
      /Erik

      > Thanks and best regards,
      > Martin
      >
      >
      > -----Original Message-----
      > From: Kim Barrett [mailto:kim.barrett at oracle.com]
      > Sent: Dienstag, 29. Mai 2018 01:54
      > To: Michihiro Horie <HORIE at jp.ibm.com>
      > Cc: Erik Osterlund <erik.osterlund at oracle.com>;
      david.holmes at oracle.com; Gustavo Bueno Romero <gromero at br.ibm.com>;
      hotspot-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net;
      ppc-aix-port-dev at openjdk.java.net; Doerr, Martin
      <martin.doerr at sap.com>
      > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and
      copy_to_survivor for ppc64
      >
      >> On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE at jp.ibm.com>
      wrote:
      >>
      >> Hi Erik,
      >>
      >> Thank you very much for your review.
      >>
      >> I understood that implicit consume should not be used in the
      shared code. Also, I believe performance degradation would be
      negligible even if we use acquire.
      >>
      >> New webrev uses memory_order_acq_rel:
      http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
      > This is missing the acquire barrier on the else branch for the
      initial test, so fails to meet
      > the previously described minimal requirements for even possibly
      being sufficient.  Any
      > analysis of weakening the CAS barriers must consider that test and
      successor code.
      >
      > In the analysis, it’s not just the lexically nearby debugging /
      logging code that needs to be
      > considered; the forwardee is being returned to caller(s) that will
      presumably do something
      > with that object.
      >
      > Since the whole point of this discussion is performance, any
      proposed change should come
      > with performance information.
      >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20180602/6e7e4414/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20180602/6e7e4414/graycol.gif>