RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64

Michihiro Horie HORIE at jp.ibm.com
Mon May 28 08:12:47 UTC 2018


Hi Erik,

Thank you very much for your review.

I understood that implicit consume should not be used in the shared code.
Also, I believe performance degradation would be negligible even if we use
acquire.

New webrev uses memory_order_acq_rel:
http://cr.openjdk.java.net/~mhorie/8154736/webrev.10


Best regards,
--
Michihiro,
IBM Research - Tokyo



From:	Erik Osterlund <erik.osterlund at oracle.com>
To:	Michihiro Horie <HORIE at jp.ibm.com>
Cc:	Kim Barrett <kim.barrett at oracle.com>,
            "hotspot-dev at openjdk.java.net" <hotspot-dev at openjdk.java.net>,
            "ppc-aix-port-dev at openjdk.java.net"
            <ppc-aix-port-dev at openjdk.java.net>, Gustavo Bueno Romero
            <gromero at br.ibm.com>, "david.holmes at oracle.com"
            <david.holmes at oracle.com>, "hotspot-gc-dev at openjdk.java.net"
            <hotspot-gc-dev at openjdk.java.net>
Date:	2018/05/28 15:48
Subject:	Re: RFR(M): 8154736: enhancement of cmpxchg and
            copy_to_survivor for ppc64



Hi Michihiro,

In your analysis, you state that the failing CAS path today already relies
on implicit consume ordering as reading forwardee() after the failed CAS is
missing acquire and hence accesses into the new reloaded forwardee would
rely on (implicit) data dependencies to  the reloaded forwardee.

That part of the analysis seems wrong to me. Since today even a failed CAS
has acquire semantics (and stronger), and the reloaded forwardee always has
the same value as was observed in the failed cas (in this context), all
data dependency requirements to the reloaded forwardee are therefore no
longer needed or relied upon.

We do not use implicit consume in the shared C++ code. If you find any
instances of that, it is a bug and should be purged with fire. Even
explicit consume is currently strongly discouraged. Implicit consume is
unreliable, especially in a project with many platforms.

If you insist on using more fragile semantics that are known to be
unreliable, I would like to at least know what measurable performance
difference you observe between the semantics Kim proposed, compared to the
elided acquire variant you insist on. My gut feeling tells me that double
sync is very intrusive, but an isync scheduled almost immediately after an
lwsync, should be significantly less intrusive.

Thanks,
/Erik

On 28 May 2018, at 03:28, Michihiro Horie <HORIE at jp.ibm.com> wrote:



      Hi Kim,

      >I've discussed this with others on the GC team; we think the minimal
      >required barriers are CAS with memory_order_acq_rel, plus an acquire
      >barrier on the else branch of
      >
      > 122 if (!test_mark->is_marked()) {
      >...
      > 261 } else {
      > 262 assert(o->is_forwarded(), "Sanity");
      > 263 new_obj = o->forwardee();
      > 264 }
      >
      >We've not done enough analysis to show this is sufficient, but we
      >think anything weaker is not sufficient for shared code.

      Thank you for the discussions on your side with the GC team.
      I summarized the point on why my change works as follows. Hope we are
      on the same page with this.


      1. Current implementation

      PSPromotionManager::copy_to_survivor_space is used to move live
      objects to a different location. It uses a forwarding technique and
      allows multiple threads to compete for performing the copy step.

      The first thread succeeds in installing its copy in the old object as
      forwardee. Other threads may need to discard their copy and use the
      one generated by the first thread which has won the race.

      Written program order:
      (1) create new_obj as copy of obj
      (2) full fence
      (3) CAS to set the forwardee with new_obj
      (4) full fence
      (5) access to the new_obj's field if CAS succeeds
      (6) access to the forwardee with "o->forwardee()" if CAS fails
      (7) access to the forwardee's field if debugging is on

      When thread0 succeeds in CAS at (3), the copied new_obj by thread0
      must be accessible from thread1 at (6). (2) guarantees the order of
      (1) and (3), although it is stronger than needed for the purpose of
      ensuring a consistent view of copied new_obj from thread1.

      (5), (6), and (7) must be executed after (3). Apparently, (4) looks
      guranteeing the order, although it is redundant.

      The order of (6) and (7) is guaranteed by consume.
      (5) and (6) are on different control paths.
      (5) and (7): Thread0 owns new_obj when CAS succeeded and can access
      it
      without barrier.


      2. Proposed change

      Written program order:
      (1) create new_obj as copy of obj
      (2) release fence
      (3) CAS to set the forwardee with new_obj
      (4) no fence
      (5) access to the new_obj's field if CAS succeeds
      (6) access to the forwardee with "o->forwardee()" if CAS fails
      (7) access to the forwardee's field if debugging is on

      Release fence at (2) is sufficient to make the copied new_obj
      accessible from a thread that fails in CAS.

      No fence at (4) is acceptable because it is redundant.

      The order of (5), (6), and (7) is the same as the current
      implementation. It is not affected by the proposed change.


      3. Reason why this is sufficient

      Memory coherence guarantees that all the threads share a consistent
      view on the access to the same memory location, which is "_mark" in
      the target code. Thread0 writes the "_mark" when it succeeds in
      CAS at (3) and thread1 reads the "_mark" when it failes in CAS at
      (3).
      Thread1 also reads the "_mark" by invoking "o->forwardee()" at (6).
      (See CoRR1 in Section 8 of
      https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf)

      Also, compilers do not speculatively load "o->forwardee()" at (6)
      before the CAS at (3). This is ensured by the integrated compiler
      barriers (clobber "memory" in the volatile inline asm code). And it
      is
      also prevented because "_mark" is declared volatile.



      Best regards,
      --
      Michihiro,
      IBM Research - Tokyo

      <graycol.gif>Kim Barrett ---2018/05/26 01:01:40---> On May 22, 2018,
      at 12:16 PM, Doerr, Martin <martin.doerr at sap.com> wrote: >

      From: Kim Barrett <kim.barrett at oracle.com>
      To: "Doerr, Martin" <martin.doerr at sap.com>
      Cc: Michihiro Horie <HORIE at jp.ibm.com>, "hotspot-dev at openjdk.java.net
      " <hotspot-dev at openjdk.java.net>, "hotspot-gc-dev at openjdk.java.net" <
      hotspot-gc-dev at openjdk.java.net>, Gustavo Bueno Romero <
      gromero at br.ibm.com>, "ppc-aix-port-dev at openjdk.java.net" <
      ppc-aix-port-dev at openjdk.java.net>, "david.holmes at oracle.com" <
      david.holmes at oracle.com>
      Date: 2018/05/26 01:01
      Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and
      copy_to_survivor for ppc64





      > On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr at sap.com>
      wrote:
      >
      > Hi Kim,
      >
      > I can't see how a new implicit consume is introduced by Michihiro's
      change. He just explained how the existing code works.
      >
      > If implicit consume has been rejected the current code is wrong:
      > "new_obj = o->forwardee();" would need some kind of barrier before
      using the new_obj.
      >
      > But this issue is not related to what Michihiro wants to change
      AFAICS.

      The current full-fence CAS guarantees the stores into the new
      forwardee installed by the CAS happen before loads from that object
      after the CAS.  Algorithmically, o->forwardee() is guaranteed to be
      the same object as was returned by the CAS.  Hence, loads from the
      forwardee are being ordered by the fenced CAS.

      I've discussed this with others on the GC team; we think the minimal
      required barriers are CAS with memory_order_acq_rel, plus an acquire
      barrier on the else branch of

      122   if (!test_mark->is_marked()) {
      ...
      261   } else {
      262     assert(o->is_forwarded(), "Sanity");
      263     new_obj = o->forwardee();
      264   }

      We've not done enough analysis to show this is sufficient, but we
      think anything weaker is not sufficient for shared code.


      >
      > Best regards,
      > Martin
      >
      >
      > -----Original Message-----
      > From: ppc-aix-port-dev [
      mailto:ppc-aix-port-dev-bounces at openjdk.java.net] On Behalf Of Kim
      Barrett
      > Sent: Montag, 21. Mai 2018 06:00
      > To: Michihiro Horie <HORIE at jp.ibm.com>
      > Cc: hotspot-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net;
      Gustavo Bueno Romero <gromero at br.ibm.com>;
      ppc-aix-port-dev at openjdk.java.net; david.holmes at oracle.com
      > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and
      copy_to_survivor for ppc64
      >
      >> On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE at jp.ibm.com>
      wrote:
      >>
      >> Dear all,
      >>
      >> I update the webrev:
      http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
      >>
      >> With the release barrier before the CAS, new_obj can be observed
      from other threads. If the CAS succeeds, the current thread can use
      new_obj without barriers. If the CAS fails, "o->forwardee()" is
      ordered with respect to CAS by accessing the same memory location
      "_mark", so no barriers needed. The order of (1) access to the
      forwardee and (2) access to forwardee's fields is preserved due to
      Release-Consume ordering on supported platforms. (The ordering
      between "new_obj = o->forwardee();" and logging or other usages is
      not changed.)
      >>
      >> Regarding the maintainability, the requirement is CAS
      (memory_order_release) as Release-Consume to be consistent with C+
      +11. This requirement is necessary when a weaker platform like DEC
      Alpha is to be supported. On currently supported platforms, code
      change can be safe if the code meats this requirement (and the order
      of (1) access to the forwardee and (2) access to forwardee's fields
      is the natural way of coding).
      >
      > Relying on implicit consume has been been discussed and rejected,
      in
      > the earlier thread on this topic and I think elsewhere too.
      >
      >
      http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021538.html








-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/attachments/20180528/141b07d0/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/attachments/20180528/141b07d0/graycol-0001.gif>


More information about the ppc-aix-port-dev mailing list