<font size=2 face="sans-serif">Hi David,</font><br><br><font size=2 face="sans-serif">Thank you for your comments and questions.</font><br><br><font size=2 face="sans-serif">> 1. Are the current cmpxchg semantics
exactly the same as </font><br><font size=2 face="sans-serif">> memory_order_seq_cst?</font><br><br><font size=2 face="sans-serif">This is very good question..</font><br><br><font size=2 face="sans-serif">I guess, cmpxchg needs a more conservative
constraint for memory ordering</font><br><font size=2 face="sans-serif">than C++11, to add sync after a compare-and-exchange
operation. </font><br><br><font size=2 face="sans-serif">Could someone give comments or thoughts?</font><br><br><font size=2 face="sans-serif">memory_order_seq_cst is defined as </font><br><font size=2 face="sans-serif"> "Any operation with
this memory order is both an acquire operation and </font><br><font size=2 face="sans-serif"> a release operation,
plus a single total order exists in which all threads</font><br><font size=2 face="sans-serif"> observe all modifications
(see below) in the same order."</font><br><font size=2 face="sans-serif">(</font><a href=http://en.cppreference.com/w/cpp/atomic/memory_order><font size=2 color=blue face="sans-serif">http://en.cppreference.com/w/cpp/atomic/memory_order</font></a><font size=2 face="sans-serif">)</font><br><br><font size=2 face="sans-serif">In my environment, g++ and xlc generate
following assemblies on ppc64le.</font><br><font size=2 face="sans-serif">(interestingly, they generates the same
assemblies for any memory_order)</font><br><br><font size=2 face="sans-serif">g++ (4.9.2)</font><br><font size=2 face="sans-serif"> 100008a4: ac 04
00 7c sync </font><br><font size=2 face="sans-serif"> 100008a8: 28 50
20 7d lwarx r9,0,r10</font><br><font size=2 face="sans-serif"> 100008ac: 00 18
09 7c cmpw r9,r3</font><br><font size=2 face="sans-serif"> 100008b0: 0c 00
c2 40 bne- 100008bc</font><br><font size=2 face="sans-serif"> 100008b4: 2d 51
80 7c stwcx. r4,0,r10</font><br><font size=2 face="sans-serif"> 100008b8: f0 ff
c2 40 bne- 100008a8</font><br><font size=2 face="sans-serif"> 100008bc: 2c 01
00 4c isync</font><br><br><font size=2 face="sans-serif">xlc (13.1.3)</font><br><font size=2 face="sans-serif"> 10000888: ac 04
00 7c sync </font><br><font size=2 face="sans-serif"> 1000088c: 28 28
c0 7c lwarx r6,0,r5</font><br><font size=2 face="sans-serif"> 10000890: 40 00
26 7c cmpld r6,r0</font><br><font size=2 face="sans-serif"> 10000894: 0c 00
82 40 bne 100008a0</font><br><font size=2 face="sans-serif"> 10000898: 2d 29
80 7c stwcx. r4,0,r5</font><br><font size=2 face="sans-serif"> 1000089c: f0 ff
e2 40 bne+ 1000088c</font><br><font size=2 face="sans-serif"> 100008a0: 2c 01
00 4c isync</font><br><br><font size=2 face="sans-serif">On the other hand, the current OpenJDK
generates following assemblies.</font><br><br><font size=2 face="sans-serif"> 508: ac 04 00 7c
sync </font><br><font size=2 face="sans-serif"> 50c: 00 00 5c e9
ld r10,0(r28)</font><br><font size=2 face="sans-serif"> 510: 00 50 3b 7c
cmpd r27,r10</font><br><font size=2 face="sans-serif"> 514: 1c 00 c2 40
bne- 530</font><br><font size=2 face="sans-serif"> 518: a8 40 5c 7d
ldarx r10,r28,r8</font><br><font size=2 face="sans-serif"> 51c: 00 50 3b 7c
cmpd r27,r10</font><br><font size=2 face="sans-serif"> 520: 10 00 c2 40
bne- 530</font><br><font size=2 face="sans-serif"> 524: ad 41 3c 7d
stdcx. r9,r28,r8</font><br><font size=2 face="sans-serif"> 528: f0 ff c2 40
bne- 518</font><br><font size=2 face="sans-serif"> 52c: ac 04 00 7c
sync </font><br><font size=2 face="sans-serif"> 530: 00 50 bb 7f
...</font><br><br><font size=2 face="sans-serif">Though we can ignore 50c-514 (because
they are a duplicated guard condition), </font><br><font size=2 face="sans-serif">the last sync instruction (52c) makes
cmpxchg more strict than </font><br><font size=2 face="sans-serif">memory_order_seq_cst.</font><br><br><font size=2 face="sans-serif">In some cases, the last sync is necessary
when this thread must be able to read</font><br><font size=2 face="sans-serif">all of the changes in the other threads
while executing from 508 to 530 </font><br><font size=2 face="sans-serif">(that processes compare-and-exchange).</font><br><br><font size=2 face="sans-serif">> 2. Has there been a discussion
already, establishing that the modified </font><br><font size=2 face="sans-serif">> GC code can indeed use memory_order_relaxed?
Otherwise who is </font><br><font size=2 face="sans-serif">> postulating that and based on what
evidence?</font><br><br><font size=2 face="sans-serif">Volker and his colleagues have investigated
the current GC codes according to this.</font><br><a href="http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-April/019079.html"><font size=2 color=blue face="sans-serif">http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-April/019079.html</font></a><br><font size=2 face="sans-serif">However, I believe, we need comments
of other GC experts to change </font><br><font size=2 face="sans-serif">the shared codes.</font><br><br><font size=2 face="sans-serif">Regards,<br>Hiroshi<br>-----------------------<br>Hiroshi Horii, Ph.D.<br>IBM Research - Tokyo<br></font><br><br><tt><font size=2>David Holmes <david.holmes@oracle.com> wrote
on 04/22/2016 21:57:07:<br><br>> From: David Holmes <david.holmes@oracle.com></font></tt><br><tt><font size=2>> To: Hiroshi H Horii/Japan/IBM@IBMJP, hotspot-runtime-<br>> dev@openjdk.java.net, hotspot-gc-dev@openjdk.java.net</font></tt><br><tt><font size=2>> Cc: Tim Ellison <Tim_Ellison@uk.ibm.com>,
ppc-aix-port-dev@openjdk.java.net</font></tt><br><tt><font size=2>> Date: 04/22/2016 21:58</font></tt><br><tt><font size=2>> Subject: Re: RFR(M): 8154736: enhancement of
cmpxchg and <br>> copy_to_survivor for ppc64</font></tt><br><tt><font size=2>> <br>> Hi Hiroshi,<br>> <br>> Two initial questions:<br>> <br>> 1. Are the current cmpxchg semantics exactly the same as <br>> memory_order_seq_cst?<br>> <br>> 2. Has there been a discussion already, establishing that the modified
<br>> GC code can indeed use memory_order_relaxed? Otherwise who is <br>> postulating that and based on what evidence?<br>> <br>> Missing memory barriers have caused very difficult to track down bugs
in <br>> the past - very rare race conditions. So any relaxation here has to
be <br>> done with extreme confidence.<br>> <br>> Thanks,<br>> David<br>> <br>> On 22/04/2016 10:28 PM, Hiroshi H Horii wrote:<br>> > Dear all:<br>> ><br>> > Can I please request reviews for the following change?<br>> ><br>> > Code change:<br>> > </font></tt><a href=http://cr.openjdk.java.net/~mdoerr/8154736_copy_to_survivor/webrev.00/><tt><font size=2>http://cr.openjdk.java.net/~mdoerr/8154736_copy_to_survivor/webrev.00/</font></tt></a><tt><font size=2><br>> > (I initially created and Martin enhanced so much)<br>> ><br>> > This change follows the discussion started from this mail.<br>> > </font></tt><a href="http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-"><tt><font size=2>http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-</font></tt></a><tt><font size=2><br>> April/018960.html<br>> ><br>> > Description:<br>> > This change provides relaxed compare-and-exchange by introducing<br>> > similar semantics of C++ atomic memory operators, enum memory_order.<br>> > As described in atomic_linux_ppc.inline.hpp, the current implementation
of<br>> > cmpxchg is fence_cmpxchg_acquire. This implementation is useful
for<br>> > general purposes because twice calls of sync before and after
cmpxchg will<br>> > provide strict consistency. However, they sometimes cause overheads<br>> > because<br>> > sync instructions are very expensive in the current POWER chip
design.<br>> > In addition, for the other platforms, such as aarch64, this strict<br>> > semantics<br>> > may cause some overheads (according to the Andrew's mail).<br>> > </font></tt><a href="http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-"><tt><font size=2>http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-</font></tt></a><tt><font size=2><br>> April/019073.html<br>> ><br>> > With this change, callers can explicitly specify constraints
of memory<br>> > ordering<br>> > for cmpxchg with an additional parameter, memory_order order.<br>> ><br>> > typedef enum memory_order {<br>> > memory_order_relaxed,<br>> > memory_order_consume,<br>> > memory_order_acquire,<br>> > memory_order_release,<br>> > memory_order_acq_rel,<br>> > memory_order_seq_cst<br>> > } memory_order;<br>> ><br>> > Because the default value of the parameter is memory_order_seq_cst,<br>> > existing codes can use the same semantics of cmpxchg without
any<br>> > modification. The relaxed cmpxchg is implemented only on ppc<br>> > in this changeset. Therefore, the behavior on the other platforms
will<br>> > not be changed with this changeset.<br>> ><br>> > In addition, with the new parameter of cmpxchg, this change improves<br>> > performance of copy_to_survivor in the parallel GC.<br>> > copy_to_survivor changes forward pointers by using cmpxchg. This<br>> > operation doesn't require any sync instructions. A pointer
is changed<br>> > at most once in a GC and when cmpxchg fails, the latest pointer
is<br>> > available for the caller. cas_set_mark and cas_forward_to are
extended<br>> > with an additional memory_order parameter as cmpxchg and copy_to_survivor<br>> > uses memory_order_relaxed to modify the forward pointers.<br>> ><br>> > Summary of source code changes:<br>> ><br>> > * src/share/vm/runtime/atomic.hpp<br>> > - Defines enum memory_order and adds a parameter
to cmpxchg.<br>> ><br>> > * src/share/vm/runtime/atomic.cpp<br>> > * src/os_cpu/bsd_x86/vm/atomic_bsd_x86.inline.hpp<br>> > * src/os_cpu/bsd_zero/vm/atomic_bsd_zero.inline.hpp<br>> > * src/os_cpu/linux_aarch64/vm/atomic_linux_aarch64.inline.hpp<br>> > * src/os_cpu/linux_sparc/vm/atomic_linux_sparc.inline.hpp<br>> > * src/os_cpu/linux_x86/vm/atomic_linux_x86.inline.hpp<br>> > * src/os_cpu/linux_zero/vm/atomic_linux_zero.inline.hpp<br>> > * src/os_cpu/solaris_sparc/vm/atomic_solaris_sparc.inline.hpp<br>> > * src/os_cpu/solaris_x86/vm/atomic_solaris_x86.inline.hpp<br>> > * src/os_cpu/windows_x86/vm/atomic_windows_x86.inline.hpp<br>> > - Added a parameter for each cmpxchg function
to follow<br>> > the change of atomic.hpp. Their
implementations are not changed.<br>> ><br>> > * src/os_cpu/aix_ppc/vm/atomic_aix_ppc.inline.hpp<br>> > * src/os_cpu/linux_ppc/vm/atomic_linux_ppc.inline.hpp<br>> > - Added a parameter for each cmpxchg function
to follow<br>> > the change of atomic.hpp. In
addition, implementations<br>> > are changed corresponding to
the specified memory_order.<br>> ><br>> > * src/share/vm/oops/oop.hpp<br>> > * src/share/vm/oops/oop.inline.hpp<br>> > - Add a memory_order parameter to use relaxed
cmpxchg in<br>> > cas_set_mark and cas_forward_to.<br>> ><br>> > * src/share/vm/gc/parallel/psPromotionManager.cpp<br>> > * src/share/vm/gc/parallel/psPromotionManager.inline.hpp<br>> ><br>> > Martin tested this changeset on linuxx86_64, linuxppc64le
and<br>> > darwinintel64.<br>> > Though more time is needed to test on the other platform, we
would like to<br>> > ask<br>> > reviews and start discussion on this changeset.<br>> > I also tested this changeset with SPECjbb2013 and confirmed that
gc pause<br>> > time<br>> > is reduced.<br>> ><br>> > Regards,<br>> > Hiroshi<br>> > -----------------------<br>> > Hiroshi Horii, Ph.D.<br>> > IBM Research - Tokyo<br>> ><br>> ><br>> <br></font></tt><BR>