<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}


o\:* {behavior:url(#default#VML);}


w\:* {behavior:url(#default#VML);}


.shape {behavior:url(#default#VML);}


</style><![endif]--><style><!--


/* Font Definitions */


@font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0cm;


        margin-bottom:.0001pt;


        font-size:11.0pt;


        font-family:"Calibri",sans-serif;}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:blue;


        text-decoration:underline;}


a:visited, span.MsoHyperlinkFollowed


        {mso-style-priority:99;


        color:purple;


        text-decoration:underline;}


tt


        {mso-style-priority:99;


        font-family:"Courier New";}


p.msonormal0, li.msonormal0, div.msonormal0


        {mso-style-name:msonormal;


        mso-margin-top-alt:auto;


        margin-right:0cm;


        mso-margin-bottom-alt:auto;


        margin-left:0cm;


        font-size:11.0pt;


        font-family:"Calibri",sans-serif;}


span.EmailStyle20


        {mso-style-type:personal-reply;


        font-family:"Calibri",sans-serif;


        color:windowtext;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-size:10.0pt;}


@page WordSection1


        {size:612.0pt 792.0pt;


        margin:70.85pt 70.85pt 2.0cm 70.85pt;}


div.WordSection1


        {page:WordSection1;}


--></style><!--[if gte mso 9]><xml>


<o:shapedefaults v:ext="edit" spidmax="1026" />


</xml><![endif]--><!--[if gte mso 9]><xml>


<o:shapelayout v:ext="edit">


<o:idmap v:ext="edit" data="1" />


</o:shapelayout></xml><![endif]-->


</head>


<body lang="DE" link="blue" vlink="purple">


<div class="WordSection1">


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Hi everybody,<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">thank you very much for discussing this issue and helping to fix the PPC64 performance bottleneck.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Thanks, Kim and Erik, for explaining by which trick we’re using the CAS’ acquire barrier in current code.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Thanks, Michihiro, for explaining your current proposal which relies on consume. I’m convinced that it works for PPC and ARM (and of course for the strong memory model platforms).<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">If I understand it correctly, acquire is desired to help compilers and the hardware barrier is not needed.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">The current implementation just uses our handmade inline asm code, so it’s pointless for the compilers if we use acquire or not.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">However, if we want to use C++11 atomics instead of our inline asm code in the future, I think memory_order_acq_rel will be recommended to avoid compiler problems.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Was this the reason for the acquire or did I miss anything?<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">From performance point of view, I think we can live with an unnecessary acquire barrier. It’s so much cheaper than a full fence. So if this is the only remaining issue, I think we could


 just add it.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Btw. there are places in shared code where we rely on consume. I had found one and added a comment some time ago (compiledMethod.hpp):<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">“Note: _exception_cache may be read concurrently. We rely on memory_order_consume here.”<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Seems to work correctly with volatile fields.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Best regards,<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US">Martin<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>


<div>


<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">


<p class="MsoNormal"><b><span lang="EN-US">From:</span></b><span lang="EN-US"> ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net]


<b>On Behalf Of </b>Erik Osterlund<br>


<b>Sent:</b> Montag, 28. Mai 2018 08:48<br>


<b>To:</b> Michihiro Horie <HORIE@jp.ibm.com><br>


<b>Cc:</b> hotspot-dev@openjdk.java.net; Kim Barrett <kim.barrett@oracle.com>; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; david.holmes@oracle.com<br>


<b>Subject:</b> Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64<o:p></o:p></span></p>


</div>


</div>


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<p class="MsoNormal">Hi Michihiro,<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">In your analysis, you state that the failing CAS path today already relies on implicit consume ordering as reading forwardee() after the failed CAS is missing acquire and hence accesses into the new reloaded forwardee would rely on (implicit)


 data dependencies to  the reloaded forwardee.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">That part of the analysis seems wrong to me. Since today even a failed CAS has acquire semantics (and stronger), and the reloaded forwardee always has the same value as was observed in the failed cas (in this context), all data dependency


 requirements to the reloaded forwardee are therefore no longer needed or relied upon.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">We do not use implicit consume in the shared C++ code. If you find any instances of that, it is a bug and should be purged with fire. Even explicit consume is currently strongly discouraged. Implicit consume is unreliable, especially in


 a project with many platforms.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">If you insist on using more fragile semantics that are known to be unreliable, I would like to at least know what measurable performance difference you observe between the semantics Kim proposed, compared to the elided acquire variant you


 insist on. My gut feeling tells me that double sync is very intrusive, but an isync scheduled almost immediately after an lwsync, should be significantly less intrusive.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">Thanks,<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">/Erik<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal" style="margin-bottom:12.0pt"><br>


On 28 May 2018, at 03:28, Michihiro Horie <<a href="mailto:HORIE@jp.ibm.com">HORIE@jp.ibm.com</a>> wrote:<o:p></o:p></p>


</div>


<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">


<div>


<p><span style="font-size:10.0pt">Hi Kim,</span><br>


<br>


<span style="font-size:10.0pt">>I've discussed this with others on the GC team; we think the minimal</span><br>


<span style="font-size:10.0pt">>required barriers are CAS with memory_order_acq_rel, plus an acquire</span><br>


<span style="font-size:10.0pt">>barrier on the else branch of</span><br>


<span style="font-size:10.0pt">></span><br>


<span style="font-size:10.0pt">> 122 if (!test_mark->is_marked()) {</span><br>


<span style="font-size:10.0pt">>...</span><br>


<span style="font-size:10.0pt">> 261 } else {</span><br>


<span style="font-size:10.0pt">> 262 assert(o->is_forwarded(), "Sanity");</span><br>


<span style="font-size:10.0pt">> 263 new_obj = o->forwardee();</span><br>


<span style="font-size:10.0pt">> 264 }</span><br>


<span style="font-size:10.0pt">></span><br>


<span style="font-size:10.0pt">>We've not done enough analysis to show this is sufficient, but we</span><br>


<span style="font-size:10.0pt">>think anything weaker is not sufficient for shared code.</span><br>


<br>


<span style="font-size:10.0pt">Thank you for the discussions on your side with the GC team.</span><br>


<span style="font-size:10.0pt">I summarized the point on why my change works as follows. Hope we are on the same page with this.</span><br>


<br>


<br>


<span style="font-size:10.0pt">1. Current implementation</span><br>


<br>


<span style="font-size:10.0pt">PSPromotionManager::copy_to_survivor_space is used to move live</span><br>


<span style="font-size:10.0pt">objects to a different location. It uses a forwarding technique and</span><br>


<span style="font-size:10.0pt">allows multiple threads to compete for performing the copy step.</span><br>


<br>


<span style="font-size:10.0pt">The first thread succeeds in installing its copy in the old object as</span><br>


<span style="font-size:10.0pt">forwardee. Other threads may need to discard their copy and use the</span><br>


<span style="font-size:10.0pt">one generated by the first thread which has won the race.</span><br>


<br>


<span style="font-size:10.0pt">Written program order:</span><br>


<span style="font-size:10.0pt">(1) create new_obj as copy of obj</span><br>


<span style="font-size:10.0pt">(2) full fence</span><br>


<span style="font-size:10.0pt">(3) CAS to set the forwardee with new_obj</span><br>


<span style="font-size:10.0pt">(4) full fence</span><br>


<span style="font-size:10.0pt">(5) access to the new_obj's field if CAS succeeds</span><br>


<span style="font-size:10.0pt">(6) access to the forwardee with "o->forwardee()" if CAS fails</span><br>


<span style="font-size:10.0pt">(7) access to the forwardee's field if debugging is on</span><br>


<br>


<span style="font-size:10.0pt">When thread0 succeeds in CAS at (3), the copied new_obj by thread0</span><br>


<span style="font-size:10.0pt">must be accessible from thread1 at (6). (2) guarantees the order of</span><br>


<span style="font-size:10.0pt">(1) and (3), although it is stronger than needed for the purpose of</span><br>


<span style="font-size:10.0pt">ensuring a consistent view of copied new_obj from thread1.</span><br>


<br>


<span style="font-size:10.0pt">(5), (6), and (7) must be executed after (3). Apparently, (4) looks</span><br>


<span style="font-size:10.0pt">guranteeing the order, although it is redundant.</span><br>


<br>


<span style="font-size:10.0pt">The order of (6) and (7) is guaranteed by consume.</span><br>


<span style="font-size:10.0pt">(5) and (6) are on different control paths.</span><br>


<span style="font-size:10.0pt">(5) and (7): Thread0 owns new_obj when CAS succeeded and can access it</span><br>


<span style="font-size:10.0pt">without barrier.</span><br>


<br>


<br>


<span style="font-size:10.0pt">2. Proposed change</span><br>


<br>


<span style="font-size:10.0pt">Written program order:</span><br>


<span style="font-size:10.0pt">(1) create new_obj as copy of obj</span><br>


<span style="font-size:10.0pt">(2) release fence</span><br>


<span style="font-size:10.0pt">(3) CAS to set the forwardee with new_obj</span><br>


<span style="font-size:10.0pt">(4) no fence</span><br>


<span style="font-size:10.0pt">(5) access to the new_obj's field if CAS succeeds</span><br>


<span style="font-size:10.0pt">(6) access to the forwardee with "o->forwardee()" if CAS fails</span><br>


<span style="font-size:10.0pt">(7) access to the forwardee's field if debugging is on</span><br>


<br>


<span style="font-size:10.0pt">Release fence at (2) is sufficient to make the copied new_obj</span><br>


<span style="font-size:10.0pt">accessible from a thread that fails in CAS.</span><br>


<br>


<span style="font-size:10.0pt">No fence at (4) is acceptable because it is redundant.</span><br>


<br>


<span style="font-size:10.0pt">The order of (5), (6), and (7) is the same as the current</span><br>


<span style="font-size:10.0pt">implementation. It is not affected by the proposed change.</span><br>


<br>


<br>


<span style="font-size:10.0pt">3. Reason why this is sufficient</span><br>


<br>


<span style="font-size:10.0pt">Memory coherence guarantees that all the threads share a consistent</span><br>


<span style="font-size:10.0pt">view on the access to the same memory location, which is "_mark" in</span><br>


<span style="font-size:10.0pt">the target code. Thread0 writes the "_mark" when it succeeds in</span><br>


<span style="font-size:10.0pt">CAS at (3) and thread1 reads the "_mark" when it failes in CAS at (3).</span><br>


<span style="font-size:10.0pt">Thread1 also reads the "_mark" by invoking "o->forwardee()" at (6).</span><br>


<span style="font-size:10.0pt">(See CoRR1 in Section 8 of</span><br>


<span style="font-size:10.0pt"><a href="https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf">https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf</a>)</span><br>


<br>


<span style="font-size:10.0pt">Also, compilers do not speculatively load "o->forwardee()" at (6)</span><br>


<span style="font-size:10.0pt">before the CAS at (3). This is ensured by the integrated compiler</span><br>


<span style="font-size:10.0pt">barriers (clobber "memory" in the volatile inline asm code). And it is</span><br>


<span style="font-size:10.0pt">also prevented because "_mark" is declared volatile.</span><br>


<br>


<br>


<br>


<span style="font-size:10.0pt">Best regards,</span><br>


<span style="font-size:10.0pt">--</span><br>


<span style="font-size:10.0pt">Michihiro,</span><br>


<span style="font-size:10.0pt">IBM Research - Tokyo</span><br>


<br>


<graycol.gif><span style="font-size:10.0pt;color:#424282">Kim Barrett ---2018/05/26 01:01:40---> On May 22, 2018, at 12:16 PM, Doerr, Martin <<a href="mailto:martin.doerr@sap.com">martin.doerr@sap.com</a>> wrote: ></span><br>


<br>


<span style="font-size:10.0pt;color:#5F5F5F">From: </span><span style="font-size:10.0pt">Kim Barrett <<a href="mailto:kim.barrett@oracle.com">kim.barrett@oracle.com</a>></span><br>


<span style="font-size:10.0pt;color:#5F5F5F">To: </span><span style="font-size:10.0pt">"Doerr, Martin" <<a href="mailto:martin.doerr@sap.com">martin.doerr@sap.com</a>></span><br>


<span style="font-size:10.0pt;color:#5F5F5F">Cc: </span><span style="font-size:10.0pt">Michihiro Horie <<a href="mailto:HORIE@jp.ibm.com">HORIE@jp.ibm.com</a>>, "<a href="mailto:hotspot-dev@openjdk.java.net">hotspot-dev@openjdk.java.net</a>" <<a href="mailto:hotspot-dev@openjdk.java.net">hotspot-dev@openjdk.java.net</a>>,


 "<a href="mailto:hotspot-gc-dev@openjdk.java.net">hotspot-gc-dev@openjdk.java.net</a>" <<a href="mailto:hotspot-gc-dev@openjdk.java.net">hotspot-gc-dev@openjdk.java.net</a>>, Gustavo Bueno Romero <<a href="mailto:gromero@br.ibm.com">gromero@br.ibm.com</a>>,


 "<a href="mailto:ppc-aix-port-dev@openjdk.java.net">ppc-aix-port-dev@openjdk.java.net</a>" <<a href="mailto:ppc-aix-port-dev@openjdk.java.net">ppc-aix-port-dev@openjdk.java.net</a>>, "<a href="mailto:david.holmes@oracle.com">david.holmes@oracle.com</a>" <<a href="mailto:david.holmes@oracle.com">david.holmes@oracle.com</a>></span><br>


<span style="font-size:10.0pt;color:#5F5F5F">Date: </span><span style="font-size:10.0pt">2018/05/26 01:01</span><br>


<span style="font-size:10.0pt;color:#5F5F5F">Subject: </span><span style="font-size:10.0pt">Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64</span><o:p></o:p></p>


<div class="MsoNormal">


<hr size="2" width="100%" noshade="" style="color:#8091A5" align="left">


</div>


<p class="MsoNormal" style="margin-bottom:12.0pt"><br>


<br>


<br>


<tt><span style="font-size:10.0pt">> On May 22, 2018, at 12:16 PM, Doerr, Martin <<a href="mailto:martin.doerr@sap.com">martin.doerr@sap.com</a>> wrote:</span></tt><span style="font-size:10.0pt;font-family:"Courier New""><br>


<tt>> </tt><br>


<tt>> Hi Kim,</tt><br>


<tt>> </tt><br>


<tt>> I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.</tt><br>


<tt>> </tt><br>


<tt>> If implicit consume has been rejected the current code is wrong:</tt><br>


<tt>> "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.</tt><br>


<tt>> </tt><br>


<tt>> But this issue is not related to what Michihiro wants to change AFAICS.</tt><br>


<br>


<tt>The current full-fence CAS guarantees the stores into the new</tt><br>


<tt>forwardee installed by the CAS happen before loads from that object</tt><br>


<tt>after the CAS.  Algorithmically, o->forwardee() is guaranteed to be</tt><br>


<tt>the same object as was returned by the CAS.  Hence, loads from the</tt><br>


<tt>forwardee are being ordered by the fenced CAS.</tt><br>


<br>


<tt>I've discussed this with others on the GC team; we think the minimal</tt><br>


<tt>required barriers are CAS with memory_order_acq_rel, plus an acquire</tt><br>


<tt>barrier on the else branch of</tt><br>


<br>


<tt>122   if (!test_mark->is_marked()) {</tt><br>


<tt>...</tt><br>


<tt>261   } else {</tt><br>


<tt>262     assert(o->is_forwarded(), "Sanity");</tt><br>


<tt>263     new_obj = o->forwardee();</tt><br>


<tt>264   }</tt><br>


<br>


<tt>We've not done enough analysis to show this is sufficient, but we</tt><br>


<tt>think anything weaker is not sufficient for shared code.</tt><br>


<br>


<br>


<tt>> </tt><br>


<tt>> Best regards,</tt><br>


<tt>> Martin</tt><br>


<tt>> </tt><br>


<tt>> </tt><br>


<tt>> -----Original Message-----</tt><br>


<tt>> From: ppc-aix-port-dev [<a href="mailto:ppc-aix-port-dev-bounces@openjdk.java.net">mailto:ppc-aix-port-dev-bounces@openjdk.java.net</a>] On Behalf Of Kim Barrett</tt><br>


<tt>> Sent: Montag, 21. Mai 2018 06:00</tt><br>


<tt>> To: Michihiro Horie <<a href="mailto:HORIE@jp.ibm.com">HORIE@jp.ibm.com</a>></tt><br>


<tt>> Cc: <a href="mailto:hotspot-dev@openjdk.java.net">hotspot-dev@openjdk.java.net</a>;


<a href="mailto:hotspot-gc-dev@openjdk.java.net">hotspot-gc-dev@openjdk.java.net</a>; Gustavo Bueno Romero <<a href="mailto:gromero@br.ibm.com">gromero@br.ibm.com</a>>;


<a href="mailto:ppc-aix-port-dev@openjdk.java.net">ppc-aix-port-dev@openjdk.java.net</a>;


<a href="mailto:david.holmes@oracle.com">david.holmes@oracle.com</a></tt><br>


<tt>> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64</tt><br>


<tt>> </tt><br>


<tt>>> On May 18, 2018, at 5:12 PM, Michihiro Horie <<a href="mailto:HORIE@jp.ibm.com">HORIE@jp.ibm.com</a>> wrote:</tt><br>


<tt>>> </tt><br>


<tt>>> Dear all,</tt><br>


<tt>>> </tt><br>


<tt>>> I update the webrev: <a href="http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/">


http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/</a></tt><br>


<tt>>> </tt><br>


<tt>>> With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory


 location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other


 usages is not changed.)</tt><br>


<tt>>> </tt><br>


<tt>>> Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code


 change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).</tt><br>


<tt>> </tt><br>


<tt>> Relying on implicit consume has been been discussed and rejected, in</tt><br>


<tt>> the earlier thread on this topic and I think elsewhere too.</tt><br>


<tt>> </tt><br>


<tt>> <a href="http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021538.html">


http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021538.html</a></tt><br>


<br>


<br>


<br>


<br>


</span><o:p></o:p></p>


</div>


</blockquote>


</div>


</body>


</html>