RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Dear all, I would like to ask reviews on 8154736 “enhancement of cmpxchg and copy_to_survivor”. The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64. Although discussions stopped at http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002718.... , I would like to restart the review by taking over Hiroshi's work if the discussion is still open. Bug: https://bugs.openjdk.java.net/browse/JDK-8154736 Webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.08/ Previous review had discussions on improper log output ( http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-September/00267... ). Logs can be generated properly with this change, but I would like to ask if we should use “if(log) OrderAccess:acquire()” as is in webrev or more general approach with a call to OrderAccess:consume() with empty implementation on all supported platforms. Also, there were discussions on the problem of unawareness of copied obj ( http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002696.... ). This change adds “release” in cmpxchg_pre_membar. This was discussed in http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002698.... . I measured SPECjbb2015 with its multi JVMs mode on a POWER8 node (for JDK11 , I modified MANIFEST in specjbb2015.jar to specify locations of JAXB related libraries). As a result, critical-jOPS improved by 6% due to this change. Best regards, -- Michihiro, IBM Research - Tokyo
Hi Michihiro, thanks for posting the new webrev. I think this looks much better, now. I don't like the "if log_is_enabled() acquire()" code so much, but it looks correct to me. I'd prefer your OrderAccess:consume() proposal, but I can live with it if other reviewers prefer it. I couldn't find any store-load pattern which may miss ordering, but I'd highly appreciate if another reviewer double-checked that the barriers are right this time. If needed, we could use memory_order_acq_rel which I'm planning to add with JDK-8202080. I guess this wouldn't really impact performance. I think 6% performance gain is really worth doing this change. Best regards, Martin From: Michihiro Horie [mailto:HORIE@jp.ibm.com] Sent: Montag, 23. April 2018 12:34 To: ppc-aix-port-dev@openjdk.java.net; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net Cc: Hiroshi H Horii <HORII@jp.ibm.com>; Gustavo Romero <gromero@linux.vnet.ibm.com>; Kazunori Ogata <OGATAK@jp.ibm.com>; shade@redhat.com; aph@redhat.com; Doerr, Martin <martin.doerr@sap.com> Subject: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Dear all, I would like to ask reviews on 8154736 "enhancement of cmpxchg and copy_to_survivor". The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64. Although discussions stopped at http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002718...., I would like to restart the review by taking over Hiroshi's work if the discussion is still open. Bug: https://bugs.openjdk.java.net/browse/JDK-8154736 Webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.08/ Previous review had discussions on improper log output (http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-September/00267...). Logs can be generated properly with this change, but I would like to ask if we should use "if(log) OrderAccess:acquire()" as is in webrev or more general approach with a call to OrderAccess:consume() with empty implementation on all supported platforms. Also, there were discussions on the problem of unawareness of copied obj (http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002696....). This change adds "release" in cmpxchg_pre_membar. This was discussed in http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002698..... I measured SPECjbb2015 with its multi JVMs mode on a POWER8 node (for JDK11, I modified MANIFEST in specjbb2015.jar to specify locations of JAXB related libraries). As a result, critical-jOPS improved by 6% due to this change. Best regards, -- Michihiro, IBM Research - Tokyo
Hi! On 04/24/2018 04:20 PM, Doerr, Martin wrote:
I think 6% performance gain is really worth doing this change.
Quick question: Does this pose any assumptions on the instruction set baseline on ppc64? I haven't build-tested this yet, but I want to avoid it doesn't break on POWER5, for example, which is used in Debian's big-endian ppc64 port. Thanks, Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaubitz@debian.org `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Hi John, On 04/24/2018 11:42 AM, John Paul Adrian Glaubitz wrote:
On 04/24/2018 04:20 PM, Doerr, Martin wrote:
I think 6% performance gain is really worth doing this change.
Quick question: Does this pose any assumptions on the instruction set baseline on ppc64? I haven't build-tested this yet, but I want to avoid it doesn't break on POWER5, for example, which is used in Debian's big-endian ppc64 port.
Yes, it does. But POWER5 and above are ok. Lightweight sync (lwsync) was introduced with Power ISA v2.03 which, by its turn, maps to POWER5. Thanks for taking care of Power + Debian :-) Regards, Gustavo
Hi Gustavo! On 04/24/2018 05:54 PM, Gustavo Romero wrote:
Quick question: Does this pose any assumptions on the instruction set baseline on ppc64? I haven't build-tested this yet, but I want to avoid it doesn't break on POWER5, for example, which is used in Debian's big-endian ppc64 port.
Yes, it does. But POWER5 and above are ok. Lightweight sync (lwsync) was introduced with Power ISA v2.03 which, by its turn, maps to POWER5.
We are primarily interested in PPC970 and compatible which - according to Wikipedia - is compliant with POWER ISA 2.03:
https://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.03
It's sometimes a bit confusing to differentiate all the different POWER ISAs because there are so many. That's why I usually ask in such cases.
Thanks for taking care of Power + Debian :-)
You're welcome. Debian has both a big-endian (POWER5/970) and little-endian (POWER8) ppc64 port as well as powerpc (32-bit) and powerpcspe (e500v2) ports, so PowerPC is quite popular within Debian ;). Thanks, Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaubitz@debian.org `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Hi Michihiro, On 23/04/2018 8:33 PM, Michihiro Horie wrote:
Dear all,
I would like to ask reviews on 8154736 “enhancement of cmpxchg and copy_to_survivor”. The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64.
Although discussions stopped at http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002718.... , I would like to restart the review by taking over Hiroshi's work if the discussion is still open.
So the very last comment there was about not implicitly assuming memory_order_consume, yet that has not been addressed in the proposal. Further the discussion on hotspot-runtime-dev through September and October was far more illuminating. I think my post here: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0216... and the closely following one from Thomas Schatzl summed up the concerns about the proposed changes. AFAICS the restarted proposal addresses none of those concerns but simply takes up where the previous implementation suggestion left off. This is a proposal to change the memory ordering semantics of part of the shared GC code _not_ just the PPC64 implementation, but I have seen no analysis to demonstrate the correctness of such a proposal. David -----
Bug: https://bugs.openjdk.java.net/browse/JDK-8154736 Webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.08/
Previous review had discussions on improper log output ( http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-September/00267... ). Logs can be generated properly with this change, but I would like to ask if we should use “if(log) OrderAccess:acquire()” as is in webrev or more general approach with a call to OrderAccess:consume() with empty implementation on all supported platforms.
Also, there were discussions on the problem of unawareness of copied obj ( http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002696.... ). This change adds “release” in cmpxchg_pre_membar. This was discussed in http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002698.... .
I measured SPECjbb2015 with its multi JVMs mode on a POWER8 node (for JDK11 , I modified MANIFEST in specjbb2015.jar to specify locations of JAXB related libraries). As a result, critical-jOPS improved by 6% due to this change.
Best regards, -- Michihiro, IBM Research - Tokyo
Hi David,
So the very last comment there was about not implicitly assuming memory_order_consume, yet that has not been addressed in the proposal.
Further the discussion on hotspot-runtime-dev through September and October was far more illuminating. I think my post here:
https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pi...
and the closely following one from Thomas Schatzl summed up the concerns about the proposed changes. Thank you very much for pointing out the missing items I need to take into account.
This is a proposal to change the memory ordering semantics of part of the shared GC code _not_ just the PPC64 implementation, but I have seen no analysis to demonstrate the correctness of such a proposal. I do agree the necessity of demonstrating the correctness. I would try my best for this.
Best regards, -- Michihiro, IBM Research - Tokyo From: David Holmes <david.holmes@oracle.com> To: Michihiro Horie <HORIE@jp.ibm.com>, ppc-aix-port-dev@openjdk.java.net, hotspot-dev@openjdk.java.net, hotspot-gc-dev@openjdk.java.net Cc: Hiroshi H Horii <HORII@jp.ibm.com> Date: 2018/04/25 21:45 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Michihiro, On 23/04/2018 8:33 PM, Michihiro Horie wrote:
Dear all,
I would like to ask reviews on 8154736 “enhancement of cmpxchg and copy_to_survivor”. The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64.
Although discussions stopped at
https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pi...
, I would like to restart the review by taking over Hiroshi's work if the discussion is still open.
So the very last comment there was about not implicitly assuming memory_order_consume, yet that has not been addressed in the proposal. Further the discussion on hotspot-runtime-dev through September and October was far more illuminating. I think my post here: https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pi... and the closely following one from Thomas Schatzl summed up the concerns about the proposed changes. AFAICS the restarted proposal addresses none of those concerns but simply takes up where the previous implementation suggestion left off. This is a proposal to change the memory ordering semantics of part of the shared GC code _not_ just the PPC64 implementation, but I have seen no analysis to demonstrate the correctness of such a proposal. David -----
Bug: https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_b...
Webrev: https://urldefense.proofpoint.com/v2/url?u=http-3A__cr.openjdk.java.net_-7Em...
Previous review had discussions on improper log output (
https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pi...
). Logs can be generated properly with this change, but I would like to ask if we should use “if(log) OrderAccess:acquire()” as is in webrev or more general approach with a call to OrderAccess:consume() with empty implementation on all supported platforms.
Also, there were discussions on the problem of unawareness of copied obj (
https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pi...
). This change adds “release” in cmpxchg_pre_membar. This was discussed in
https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.openjdk.java.net_pi...
.
I measured SPECjbb2015 with its multi JVMs mode on a POWER8 node (for JDK11 , I modified MANIFEST in specjbb2015.jar to specify locations of JAXB related libraries). As a result, critical-jOPS improved by 6% due to this change.
Best regards, -- Michihiro, IBM Research - Tokyo
On Apr 25, 2018, at 8:45 AM, David Holmes <david.holmes@oracle.com> wrote:
Hi Michihiro,
On 23/04/2018 8:33 PM, Michihiro Horie wrote:
Dear all, I would like to ask reviews on 8154736 “enhancement of cmpxchg and copy_to_survivor”. The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64. Although discussions stopped at http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002718.... , I would like to restart the review by taking over Hiroshi's work if the discussion is still open.
So the very last comment there was about not implicitly assuming memory_order_consume, yet that has not been addressed in the proposal.
Further the discussion on hotspot-runtime-dev through September and October was far more illuminating. I think my post here:
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0216...
and the closely following one from Thomas Schatzl summed up the concerns about the proposed changes.
AFAICS the restarted proposal addresses none of those concerns but simply takes up where the previous implementation suggestion left off.
This is a proposal to change the memory ordering semantics of part of the shared GC code _not_ just the PPC64 implementation, but I have seen no analysis to demonstrate the correctness of such a proposal.
I agree with David here. So far we've seen no such analysis. All we have seen is a series of proposed changes and non-failing test results, all of which have then been shown to have holes. (Among other things, this suggests the set of tests being applied is inadequate.) Part of the author's job is to convince reviewers that the proposed change is correct. I'm not expecting a formal proof, but I am expecting a lot more than has been provided so far. In this latest proposal, the conditional acquire doesn't look right to me. If the logging is disabled so there is no acquire, the object is then returned to the callers, who might do what with it? Is that safe? For all callers? And is it stable through future maintenance? This is not to say that I think making those acquires unconditional is sufficient.
Dear all, I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/ With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.) Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding). oop PSPromotionManager::copy_to_survivor_space(oop o) { oop new_obj = NULL; markOop test_mark = o->mark_raw(); if (!test_mark->is_marked()) { // The same test as "o->is_forwarded()" : Copy::aligned_disjoint_words((HeapWord*)o, (HeapWord*)new_obj, ...); if (o->cas_forward_to(new_obj, test_mark)) { // CAS succeeds : new_obj->push_contents(this); } else { // CAS fails : new_obj = o->forwardee(); } } else { : new_obj = o->forwardee(); } log_develop_trace(gc, scavenge)(..., new_obj->klass()->internal_name(), p2i((void *)o), p2i((void *)new_obj), new_obj->size()); return new_obj; } Best regards, -- Michihiro, IBM Research - Tokyo ----- Original message ----- From: Kim Barrett <kim.barrett@oracle.com> To: David Holmes <david.holmes@oracle.com> Cc: Michihiro Horie <HORIE@jp.ibm.com>, ppc-aix-port-dev@openjdk.java.net, hotspot-dev@openjdk.java.net, hotspot-gc-dev@openjdk.java.net, Hiroshi H Horii <HORII@jp.ibm.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Date: Fri, Apr 27, 2018 6:03 AM
On Apr 25, 2018, at 8:45 AM, David Holmes <david.holmes@oracle.com> wrote:
Hi Michihiro,
On 23/04/2018 8:33 PM, Michihiro Horie wrote:
Dear all, I would like to ask reviews on 8154736 “enhancement of cmpxchg and copy_to_survivor”. The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64. Although discussions stopped at
http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002718....
, I would like to restart the review by taking over Hiroshi's work if the discussion is still open.
So the very last comment there was about not implicitly assuming memory_order_consume, yet that has not been addressed in the proposal.
Further the discussion on hotspot-runtime-dev through September and October was far more illuminating. I think my post here:
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0216...
and the closely following one from Thomas Schatzl summed up the concerns
about the proposed changes.
AFAICS the restarted proposal addresses none of those concerns but
simply takes up where the previous implementation suggestion left off.
This is a proposal to change the memory ordering semantics of part of
the shared GC code _not_ just the PPC64 implementation, but I have seen no analysis to demonstrate the correctness of such a proposal. I agree with David here. So far we've seen no such analysis. All we have seen is a series of proposed changes and non-failing test results, all of which have then been shown to have holes. (Among other things, this suggests the set of tests being applied is inadequate.) Part of the author's job is to convince reviewers that the proposed change is correct. I'm not expecting a formal proof, but I am expecting a lot more than has been provided so far. In this latest proposal, the conditional acquire doesn't look right to me. If the logging is disabled so there is no acquire, the object is then returned to the callers, who might do what with it? Is that safe? For all callers? And is it stable through future maintenance? This is not to say that I think making those acquires unconditional is sufficient.
Hi Michihiro, An initial question ... Can you clarify the assumptions/expectations with regard to Release-Consume given it has been deprecated by the C++ committee? memory_order_release is expected to be C++ compatible. Thanks, David On 19/05/2018 1:12 AM, Michihiro Horie wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
oop PSPromotionManager::copy_to_survivor_space(oop o) { oop new_obj = NULL; markOop test_mark = o->mark_raw();
if (!test_mark->is_marked()) { // The same test as "o->is_forwarded()" : Copy::aligned_disjoint_words((HeapWord*)o, (HeapWord*)new_obj, ...);
if (o->cas_forward_to(new_obj, test_mark)) { // CAS succeeds : new_obj->push_contents(this);
} else { // CAS fails : new_obj = o->forwardee(); } } else { : new_obj = o->forwardee(); }
log_develop_trace(gc, scavenge)(..., new_obj->klass()->internal_name(), p2i((void *)o), p2i((void *)new_obj), new_obj->size()); return new_obj; }
Best regards, -- Michihiro, IBM Research - Tokyo
----- Original message ----- From: Kim Barrett <kim.barrett@oracle.com> To: David Holmes <david.holmes@oracle.com> Cc: Michihiro Horie <HORIE@jp.ibm.com>, ppc-aix-port-dev@openjdk.java.net, hotspot-dev@openjdk.java.net, hotspot-gc-dev@openjdk.java.net, Hiroshi H Horii <HORII@jp.ibm.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Date: Fri, Apr 27, 2018 6:03 AM
On Apr 25, 2018, at 8:45 AM, David Holmes <david.holmes@oracle.com> wrote:
Hi Michihiro,
On 23/04/2018 8:33 PM, Michihiro Horie wrote:
Dear all, I would like to ask reviews on 8154736 “enhancement of cmpxchg and copy_to_survivor”. The change adds options to avoid expensive syncs with compare-and-exchange. An experiment by using SPECjbb2015 showed 6% improvement in critical-jOPS. This change focuses on ppc64 but would be potentially beneficial for aarch64. Although discussions stopped at
_http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-October/002718....
, I would like to restart the review by taking over Hiroshi's work if the discussion is still open.
So the very last comment there was about not implicitly assuming memory_order_consume, yet that has not been addressed in the proposal.
Further the discussion on hotspot-runtime-dev through September and October was far more illuminating. I think my post here:
_http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0216...
and the closely following one from Thomas Schatzl summed up the
concerns about the proposed changes.
AFAICS the restarted proposal addresses none of those concerns but
simply takes up where the previous implementation suggestion left off.
This is a proposal to change the memory ordering semantics of part of
the shared GC code _not_ just the PPC64 implementation, but I have seen no analysis to demonstrate the correctness of such a proposal.
I agree with David here. So far we've seen no such analysis. All we have seen is a series of proposed changes and non-failing test results, all of which have then been shown to have holes. (Among other things, this suggests the set of tests being applied is inadequate.) Part of the author's job is to convince reviewers that the proposed change is correct. I'm not expecting a formal proof, but I am expecting a lot more than has been provided so far.
In this latest proposal, the conditional acquire doesn't look right to me. If the logging is disabled so there is no acquire, the object is then returned to the callers, who might do what with it? Is that safe? For all callers? And is it stable through future maintenance? This is not to say that I think making those acquires unconditional is sufficient.
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too. http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
Hi Kim, I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works. If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj. But this issue is not related to what Michihiro wants to change AFAICS. Best regards, Martin -----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too. http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
David, Kim, and Marin, Thank you for your question and comments. The important point I should specify earlier is that CAS contains a compiler barrier that prevents moving code around it, e.g. by clobber "memory" in inline asm code. Thus, release with a two-way compiler barrier (CAS) is the prerequisite in my change. Regarding the Release-Consume ordering, I think making use of implicit consume is enough and it is already realized with the existing code (release and CAS). Best regards, -- Michihiro, IBM Research - Tokyo From: "Doerr, Martin" <martin.doerr@sap.com> To: Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com> Cc: "hotspot-dev@openjdk.java.net" <hotspot-dev@openjdk.java.net>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "Gustavo Bueno Romero" <gromero@br.ibm.com>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net>, "david.holmes@oracle.com" <david.holmes@oracle.com> Date: 2018/05/22 19:17 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Kim, I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works. If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj. But this issue is not related to what Michihiro wants to change AFAICS. Best regards, Martin -----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from
other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS
(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding). Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too. http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
On 22/05/2018 8:16 PM, Doerr, Martin wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But if forwardee is set by a CAS with (default) full bi-directional fence we have that barrier. The argument is that such a strong barrier can be relaxed yet maintain correctness - isn't it? David
But this issue is not related to what Michihiro wants to change AFAICS.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
Hi David, Michihiro's change does not rely on Consume AFAICS. Only existing code does. The CAS is not in the path between "new_obj = o->forwardee();" and the usage of new_obj. So the CAS' barrier won't help. There's just ordering by "Consume". The other thread's CAS (which has set the _mark field) currently has a post-barrier which won't help the reading thread, either. Michihiro's change relies on the ordering of the CAS with respect to "new_obj = o->forwardee();" by accessing the same memory location "_mark" which is ensured by memory coherency. And it relies on that compilers don't speculatively load o->forwardee() before the CAS which is ensured by the integrated compiler barriers (clobber "memory" in the volatile inline asm code). And it is also prevented because _mark is declared volatile. Best regards, Martin -----Original Message----- From: David Holmes [mailto:david.holmes@oracle.com] Sent: Mittwoch, 23. Mai 2018 10:36 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 On 22/05/2018 8:16 PM, Doerr, Martin wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But if forwardee is set by a CAS with (default) full bi-directional fence we have that barrier. The argument is that such a strong barrier can be relaxed yet maintain correctness - isn't it? David
But this issue is not related to what Michihiro wants to change AFAICS.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
Hi David, I shouldn’t have written Release-Consume even though this wording is known from C++11. Consume is currently used for accessing the forwardee: One thread releases, another thread consumes new_obj. The place where I would like to change the barriers is the CAS. We don't need to rely on consume there. The release is the important part. Best regards, -- Michihiro, IBM Research - Tokyo From: "Doerr, Martin" <martin.doerr@sap.com> To: David Holmes <david.holmes@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com> Cc: "hotspot-dev@openjdk.java.net" <hotspot-dev@openjdk.java.net>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "Gustavo Bueno Romero" <gromero@br.ibm.com>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/23 18:13 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi David, Michihiro's change does not rely on Consume AFAICS. Only existing code does. The CAS is not in the path between "new_obj = o->forwardee();" and the usage of new_obj. So the CAS' barrier won't help. There's just ordering by "Consume". The other thread's CAS (which has set the _mark field) currently has a post-barrier which won't help the reading thread, either. Michihiro's change relies on the ordering of the CAS with respect to "new_obj = o->forwardee();" by accessing the same memory location "_mark" which is ensured by memory coherency. And it relies on that compilers don't speculatively load o->forwardee() before the CAS which is ensured by the integrated compiler barriers (clobber "memory" in the volatile inline asm code). And it is also prevented because _mark is declared volatile. Best regards, Martin -----Original Message----- From: David Holmes [mailto:david.holmes@oracle.com] Sent: Mittwoch, 23. Mai 2018 10:36 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 On 22/05/2018 8:16 PM, Doerr, Martin wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But if forwardee is set by a CAS with (default) full bi-directional fence we have that barrier. The argument is that such a strong barrier can be relaxed yet maintain correctness - isn't it? David
But this issue is not related to what Michihiro wants to change AFAICS.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from
other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS
(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com> wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But this issue is not related to what Michihiro wants to change AFAICS.
The current full-fence CAS guarantees the stores into the new forwardee installed by the CAS happen before loads from that object after the CAS. Algorithmically, o->forwardee() is guaranteed to be the same object as was returned by the CAS. Hence, loads from the forwardee are being ordered by the fenced CAS. I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of 122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 } We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
Hi Kim,
I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of
122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 }
We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Thank you for the discussions on your side with the GC team. I summarized the point on why my change works as follows. Hope we are on the same page with this. 1. Current implementation PSPromotionManager::copy_to_survivor_space is used to move live objects to a different location. It uses a forwarding technique and allows multiple threads to compete for performing the copy step. The first thread succeeds in installing its copy in the old object as forwardee. Other threads may need to discard their copy and use the one generated by the first thread which has won the race. Written program order: (1) create new_obj as copy of obj (2) full fence (3) CAS to set the forwardee with new_obj (4) full fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on When thread0 succeeds in CAS at (3), the copied new_obj by thread0 must be accessible from thread1 at (6). (2) guarantees the order of (1) and (3), although it is stronger than needed for the purpose of ensuring a consistent view of copied new_obj from thread1. (5), (6), and (7) must be executed after (3). Apparently, (4) looks guranteeing the order, although it is redundant. The order of (6) and (7) is guaranteed by consume. (5) and (6) are on different control paths. (5) and (7): Thread0 owns new_obj when CAS succeeded and can access it without barrier. 2. Proposed change Written program order: (1) create new_obj as copy of obj (2) release fence (3) CAS to set the forwardee with new_obj (4) no fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on Release fence at (2) is sufficient to make the copied new_obj accessible from a thread that fails in CAS. No fence at (4) is acceptable because it is redundant. The order of (5), (6), and (7) is the same as the current implementation. It is not affected by the proposed change. 3. Reason why this is sufficient Memory coherence guarantees that all the threads share a consistent view on the access to the same memory location, which is "_mark" in the target code. Thread0 writes the "_mark" when it succeeds in CAS at (3) and thread1 reads the "_mark" when it failes in CAS at (3). Thread1 also reads the "_mark" by invoking "o->forwardee()" at (6). (See CoRR1 in Section 8 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf) Also, compilers do not speculatively load "o->forwardee()" at (6) before the CAS at (3). This is ensured by the integrated compiler barriers (clobber "memory" in the volatile inline asm code). And it is also prevented because "_mark" is declared volatile. Best regards, -- Michihiro, IBM Research - Tokyo From: Kim Barrett <kim.barrett@oracle.com> To: "Doerr, Martin" <martin.doerr@sap.com> Cc: Michihiro Horie <HORIE@jp.ibm.com>, "hotspot-dev@openjdk.java.net" <hotspot-dev@openjdk.java.net>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, Gustavo Bueno Romero <gromero@br.ibm.com>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net>, "david.holmes@oracle.com" <david.holmes@oracle.com> Date: 2018/05/26 01:01 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com> wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But this issue is not related to what Michihiro wants to change AFAICS.
The current full-fence CAS guarantees the stores into the new forwardee installed by the CAS happen before loads from that object after the CAS. Algorithmically, o->forwardee() is guaranteed to be the same object as was returned by the CAS. Hence, loads from the forwardee are being ordered by the fenced CAS. I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of 122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 } We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net]
On Behalf Of Kim Barrett
Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from
other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS
(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
Hi Michihiro, In your analysis, you state that the failing CAS path today already relies on implicit consume ordering as reading forwardee() after the failed CAS is missing acquire and hence accesses into the new reloaded forwardee would rely on (implicit) data dependencies to the reloaded forwardee. That part of the analysis seems wrong to me. Since today even a failed CAS has acquire semantics (and stronger), and the reloaded forwardee always has the same value as was observed in the failed cas (in this context), all data dependency requirements to the reloaded forwardee are therefore no longer needed or relied upon. We do not use implicit consume in the shared C++ code. If you find any instances of that, it is a bug and should be purged with fire. Even explicit consume is currently strongly discouraged. Implicit consume is unreliable, especially in a project with many platforms. If you insist on using more fragile semantics that are known to be unreliable, I would like to at least know what measurable performance difference you observe between the semantics Kim proposed, compared to the elided acquire variant you insist on. My gut feeling tells me that double sync is very intrusive, but an isync scheduled almost immediately after an lwsync, should be significantly less intrusive. Thanks, /Erik
On 28 May 2018, at 03:28, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Kim,
I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of
122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 }
We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Thank you for the discussions on your side with the GC team. I summarized the point on why my change works as follows. Hope we are on the same page with this.
1. Current implementation
PSPromotionManager::copy_to_survivor_space is used to move live objects to a different location. It uses a forwarding technique and allows multiple threads to compete for performing the copy step.
The first thread succeeds in installing its copy in the old object as forwardee. Other threads may need to discard their copy and use the one generated by the first thread which has won the race.
Written program order: (1) create new_obj as copy of obj (2) full fence (3) CAS to set the forwardee with new_obj (4) full fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on
When thread0 succeeds in CAS at (3), the copied new_obj by thread0 must be accessible from thread1 at (6). (2) guarantees the order of (1) and (3), although it is stronger than needed for the purpose of ensuring a consistent view of copied new_obj from thread1.
(5), (6), and (7) must be executed after (3). Apparently, (4) looks guranteeing the order, although it is redundant.
The order of (6) and (7) is guaranteed by consume. (5) and (6) are on different control paths. (5) and (7): Thread0 owns new_obj when CAS succeeded and can access it without barrier.
2. Proposed change
Written program order: (1) create new_obj as copy of obj (2) release fence (3) CAS to set the forwardee with new_obj (4) no fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on
Release fence at (2) is sufficient to make the copied new_obj accessible from a thread that fails in CAS.
No fence at (4) is acceptable because it is redundant.
The order of (5), (6), and (7) is the same as the current implementation. It is not affected by the proposed change.
3. Reason why this is sufficient
Memory coherence guarantees that all the threads share a consistent view on the access to the same memory location, which is "_mark" in the target code. Thread0 writes the "_mark" when it succeeds in CAS at (3) and thread1 reads the "_mark" when it failes in CAS at (3). Thread1 also reads the "_mark" by invoking "o->forwardee()" at (6). (See CoRR1 in Section 8 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf)
Also, compilers do not speculatively load "o->forwardee()" at (6) before the CAS at (3). This is ensured by the integrated compiler barriers (clobber "memory" in the volatile inline asm code). And it is also prevented because "_mark" is declared volatile.
Best regards, -- Michihiro, IBM Research - Tokyo
<graycol.gif>Kim Barrett ---2018/05/26 01:01:40---> On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com> wrote: >
From: Kim Barrett <kim.barrett@oracle.com> To: "Doerr, Martin" <martin.doerr@sap.com> Cc: Michihiro Horie <HORIE@jp.ibm.com>, "hotspot-dev@openjdk.java.net" <hotspot-dev@openjdk.java.net>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, Gustavo Bueno Romero <gromero@br.ibm.com>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net>, "david.holmes@oracle.com" <david.holmes@oracle.com> Date: 2018/05/26 01:01 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com> wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But this issue is not related to what Michihiro wants to change AFAICS.
The current full-fence CAS guarantees the stores into the new forwardee installed by the CAS happen before loads from that object after the CAS. Algorithmically, o->forwardee() is guaranteed to be the same object as was returned by the CAS. Hence, loads from the forwardee are being ordered by the fenced CAS.
I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of
122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 }
We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
Hi Erik, Thank you very much for your review. I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire. New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 Best regards, -- Michihiro, IBM Research - Tokyo From: Erik Osterlund <erik.osterlund@oracle.com> To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Kim Barrett <kim.barrett@oracle.com>, "hotspot-dev@openjdk.java.net" <hotspot-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net>, Gustavo Bueno Romero <gromero@br.ibm.com>, "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net> Date: 2018/05/28 15:48 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Michihiro, In your analysis, you state that the failing CAS path today already relies on implicit consume ordering as reading forwardee() after the failed CAS is missing acquire and hence accesses into the new reloaded forwardee would rely on (implicit) data dependencies to the reloaded forwardee. That part of the analysis seems wrong to me. Since today even a failed CAS has acquire semantics (and stronger), and the reloaded forwardee always has the same value as was observed in the failed cas (in this context), all data dependency requirements to the reloaded forwardee are therefore no longer needed or relied upon. We do not use implicit consume in the shared C++ code. If you find any instances of that, it is a bug and should be purged with fire. Even explicit consume is currently strongly discouraged. Implicit consume is unreliable, especially in a project with many platforms. If you insist on using more fragile semantics that are known to be unreliable, I would like to at least know what measurable performance difference you observe between the semantics Kim proposed, compared to the elided acquire variant you insist on. My gut feeling tells me that double sync is very intrusive, but an isync scheduled almost immediately after an lwsync, should be significantly less intrusive. Thanks, /Erik On 28 May 2018, at 03:28, Michihiro Horie <HORIE@jp.ibm.com> wrote: Hi Kim, >I've discussed this with others on the GC team; we think the minimal >required barriers are CAS with memory_order_acq_rel, plus an acquire >barrier on the else branch of > > 122 if (!test_mark->is_marked()) { >... > 261 } else { > 262 assert(o->is_forwarded(), "Sanity"); > 263 new_obj = o->forwardee(); > 264 } > >We've not done enough analysis to show this is sufficient, but we >think anything weaker is not sufficient for shared code. Thank you for the discussions on your side with the GC team. I summarized the point on why my change works as follows. Hope we are on the same page with this. 1. Current implementation PSPromotionManager::copy_to_survivor_space is used to move live objects to a different location. It uses a forwarding technique and allows multiple threads to compete for performing the copy step. The first thread succeeds in installing its copy in the old object as forwardee. Other threads may need to discard their copy and use the one generated by the first thread which has won the race. Written program order: (1) create new_obj as copy of obj (2) full fence (3) CAS to set the forwardee with new_obj (4) full fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on When thread0 succeeds in CAS at (3), the copied new_obj by thread0 must be accessible from thread1 at (6). (2) guarantees the order of (1) and (3), although it is stronger than needed for the purpose of ensuring a consistent view of copied new_obj from thread1. (5), (6), and (7) must be executed after (3). Apparently, (4) looks guranteeing the order, although it is redundant. The order of (6) and (7) is guaranteed by consume. (5) and (6) are on different control paths. (5) and (7): Thread0 owns new_obj when CAS succeeded and can access it without barrier. 2. Proposed change Written program order: (1) create new_obj as copy of obj (2) release fence (3) CAS to set the forwardee with new_obj (4) no fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on Release fence at (2) is sufficient to make the copied new_obj accessible from a thread that fails in CAS. No fence at (4) is acceptable because it is redundant. The order of (5), (6), and (7) is the same as the current implementation. It is not affected by the proposed change. 3. Reason why this is sufficient Memory coherence guarantees that all the threads share a consistent view on the access to the same memory location, which is "_mark" in the target code. Thread0 writes the "_mark" when it succeeds in CAS at (3) and thread1 reads the "_mark" when it failes in CAS at (3). Thread1 also reads the "_mark" by invoking "o->forwardee()" at (6). (See CoRR1 in Section 8 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf) Also, compilers do not speculatively load "o->forwardee()" at (6) before the CAS at (3). This is ensured by the integrated compiler barriers (clobber "memory" in the volatile inline asm code). And it is also prevented because "_mark" is declared volatile. Best regards, -- Michihiro, IBM Research - Tokyo <graycol.gif>Kim Barrett ---2018/05/26 01:01:40---> On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com> wrote: > From: Kim Barrett <kim.barrett@oracle.com> To: "Doerr, Martin" <martin.doerr@sap.com> Cc: Michihiro Horie <HORIE@jp.ibm.com>, "hotspot-dev@openjdk.java.net " <hotspot-dev@openjdk.java.net>, "hotspot-gc-dev@openjdk.java.net" < hotspot-gc-dev@openjdk.java.net>, Gustavo Bueno Romero < gromero@br.ibm.com>, "ppc-aix-port-dev@openjdk.java.net" < ppc-aix-port-dev@openjdk.java.net>, "david.holmes@oracle.com" < david.holmes@oracle.com> Date: 2018/05/26 01:01 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 > On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com> wrote: > > Hi Kim, > > I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works. > > If implicit consume has been rejected the current code is wrong: > "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj. > > But this issue is not related to what Michihiro wants to change AFAICS. The current full-fence CAS guarantees the stores into the new forwardee installed by the CAS happen before loads from that object after the CAS. Algorithmically, o->forwardee() is guaranteed to be the same object as was returned by the CAS. Hence, loads from the forwardee are being ordered by the fenced CAS. I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of 122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 } We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code. > > Best regards, > Martin > > > -----Original Message----- > From: ppc-aix-port-dev [ mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett > Sent: Montag, 21. Mai 2018 06:00 > To: Michihiro Horie <HORIE@jp.ibm.com> > Cc: hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; david.holmes@oracle.com > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 > >> On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com> wrote: >> >> Dear all, >> >> I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/ >> >> With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.) >> >> Regarding the maintainability, the requirement is CAS (memory_order_release) as Release-Consume to be consistent with C+ +11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding). > > Relying on implicit consume has been been discussed and rejected, in > the earlier thread on this topic and I think elsewhere too. > > http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code. In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object. Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Kim, I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was. Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it. Thanks and best regards, Martin -----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code. In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object. Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me. Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Erik, the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists. PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC. But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume. Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha. Thanks and best regards, Martin -----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me. Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Kim, Erik, and Martin, Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM. Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites. I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough. Best regards, -- Michihiro, IBM Research - Tokyo From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Erik, the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists. PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC. But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume. Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha. Thanks and best regards, Martin -----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" ( https://dl.acm.org/citation.cfm?id=1993520 ), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me. Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Michihiro, Looks good to me. Thanks, /Erik On 2018-06-01 17:08, Michihiro Horie wrote:
Hi Kim, Erik, and Martin,
Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ <http://cr.openjdk.java.net/%7Emhorie/8154736/webrev.13/> This change uses forwardee_acquire(), which would generate better code on ARM.
Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Best regards, -- Michihiro, IBM Research - Tokyo
Inactive hide details for "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr".
From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
------------------------------------------------------------------------
Hi Erik,
the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists.
PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC.
But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume.
Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha.
Thanks and best regards, Martin
-----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Martin and Michihiro,
On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose.
In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data.
Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me.
Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 <http://cr.openjdk.java.net/%7Emhorie/8154736/webrev.10> This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Michihiro,
Looks good to me.
Thanks a lot, Erik! Best regards, -- Michihiro, IBM Research - Tokyo From: "Erik Österlund" <erik.osterlund@oracle.com> To: Michihiro Horie <HORIE@jp.ibm.com>, "Doerr, Martin" <martin.doerr@sap.com> Cc: "Andrew Haley (aph@redhat.com)" <aph@redhat.com>, "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, Kim Barrett <kim.barrett@oracle.com>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/06/02 00:15 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Michihiro, Looks good to me. Thanks, /Erik On 2018-06-01 17:08, Michihiro Horie wrote: Hi Kim, Erik, and Martin, Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM. Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites. I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough. Best regards, -- Michihiro, IBM Research - Tokyo Inactive hide details for "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr". From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Erik, the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists. PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC. But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume. Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha. Thanks and best regards, Martin -----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote: > Hi Kim, > > I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. > So it sounds like a request to fix the current implementation in addition to what his original intend was. I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions. > Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. > What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? > The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it. Sure, that sounds good to me. Thanks, /Erik > Thanks and best regards, > Martin > > > -----Original Message----- > From: Kim Barrett [mailto:kim.barrett@oracle.com] > Sent: Dienstag, 29. Mai 2018 01:54 > To: Michihiro Horie <HORIE@jp.ibm.com> > Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 > >> On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote: >> >> Hi Erik, >> >> Thank you very much for your review. >> >> I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire. >> >> New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 > This is missing the acquire barrier on the else branch for the initial test, so fails to meet > the previously described minimal requirements for even possibly being sufficient. Any > analysis of weakening the CAS barriers must consider that test and successor code. > > In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be > considered; the forwardee is being returned to caller(s) that will presumably do something > with that object. > > Since the whole point of this discussion is performance, any proposed change should come > with performance information. >
Hi Michihiro, looks good to me, too. Thanks for adding comments where you have changed memory barriers. Pushed to submission repo and our internal testing. Best regards, Martin From: Michihiro Horie [mailto:HORIE@jp.ibm.com] Sent: Freitag, 1. Juni 2018 17:37 To: Erik Österlund <erik.osterlund@oracle.com> Cc: Andrew Haley (aph@redhat.com) <aph@redhat.com>; david.holmes@oracle.com; hotspot-gc-dev@openjdk.java.net; Kim Barrett <kim.barrett@oracle.com>; Doerr, Martin <martin.doerr@sap.com>; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Michihiro,
Looks good to me.
Thanks a lot, Erik! Best regards, -- Michihiro, IBM Research - Tokyo [Inactive hide details for "Erik Österlund" ---2018/06/02 00:15:15---Hi Michihiro, Looks good to me.]"Erik Österlund" ---2018/06/02 00:15:15---Hi Michihiro, Looks good to me. From: "Erik Österlund" <erik.osterlund@oracle.com<mailto:erik.osterlund@oracle.com>> To: Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>>, "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> Cc: "Andrew Haley (aph@redhat.com<mailto:aph@redhat.com>)" <aph@redhat.com<mailto:aph@redhat.com>>, "david.holmes@oracle.com<mailto:david.holmes@oracle.com>" <david.holmes@oracle.com<mailto:david.holmes@oracle.com>>, "hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>" <hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>>, Kim Barrett <kim.barrett@oracle.com<mailto:kim.barrett@oracle.com>>, "ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>" <ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>> Date: 2018/06/02 00:15 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 ________________________________ Hi Michihiro, Looks good to me. Thanks, /Erik On 2018-06-01 17:08, Michihiro Horie wrote: Hi Kim, Erik, and Martin, Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM. Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites. I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough. Best regards, -- Michihiro, IBM Research - Tokyo [Inactive hide details for "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of]"Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr". From: "Doerr, Martin" <martin.doerr@sap.com><mailto:martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com><mailto:erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com><mailto:kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com><mailto:HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com<mailto:aph@redhat.com>)" <aph@redhat.com><mailto:aph@redhat.com> Cc: "david.holmes@oracle.com"<mailto:david.holmes@oracle.com> <david.holmes@oracle.com><mailto:david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net"<mailto:hotspot-gc-dev@openjdk.java.net> <hotspot-gc-dev@openjdk.java.net><mailto:hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net"<mailto:ppc-aix-port-dev@openjdk.java.net> <ppc-aix-port-dev@openjdk.java.net><mailto:ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 ________________________________ Hi Erik, the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists. PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC. But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume. Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha. Thanks and best regards, Martin -----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com><mailto:martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com><mailto:kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com><mailto:HORIE@jp.ibm.com> Cc: david.holmes@oracle.com<mailto:david.holmes@oracle.com>; Gustavo Bueno Romero <gromero@br.ibm.com><mailto:gromero@br.ibm.com>; hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>; hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me. Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com><mailto:HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com><mailto:erik.osterlund@oracle.com>; david.holmes@oracle.com<mailto:david.holmes@oracle.com>; Gustavo Bueno Romero <gromero@br.ibm.com><mailto:gromero@br.ibm.com>; hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>; hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>; Doerr, Martin <martin.doerr@sap.com><mailto:martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com><mailto:HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10<http://cr.openjdk.java.net/%7Emhorie/8154736/webrev.10> This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
Hi Michihiro,
looks good to me, too. Thanks for adding comments where you have changed memory barriers. Pushed to submission repo and our internal testing.
Thanks a lot, Martin! Best regards, -- Michihiro, IBM Research - Tokyo From: "Doerr, Martin" <martin.doerr@sap.com> To: Michihiro Horie <HORIE@jp.ibm.com>, "Erik Österlund" <erik.osterlund@oracle.com> Cc: "Andrew Haley (aph@redhat.com)" <aph@redhat.com>, "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "Kim Barrett" <kim.barrett@oracle.com>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/06/04 19:17 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Michihiro, looks good to me, too. Thanks for adding comments where you have changed memory barriers. Pushed to submission repo and our internal testing. Best regards, Martin From: Michihiro Horie [mailto:HORIE@jp.ibm.com] Sent: Freitag, 1. Juni 2018 17:37 To: Erik Österlund <erik.osterlund@oracle.com> Cc: Andrew Haley (aph@redhat.com) <aph@redhat.com>; david.holmes@oracle.com; hotspot-gc-dev@openjdk.java.net; Kim Barrett <kim.barrett@oracle.com>; Doerr, Martin <martin.doerr@sap.com>; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Michihiro,
Looks good to me.
Thanks a lot, Erik! Best regards, -- Michihiro, IBM Research - Tokyo Inactive hide details for "Erik Österlund" ---2018/06/02 00:15:15---Hi Michihiro, Looks good to me."Erik Österlund" ---2018/06/02 00:15:15---Hi Michihiro, Looks good to me. From: "Erik Österlund" <erik.osterlund@oracle.com> To: Michihiro Horie <HORIE@jp.ibm.com>, "Doerr, Martin" < martin.doerr@sap.com> Cc: "Andrew Haley (aph@redhat.com)" <aph@redhat.com>, " david.holmes@oracle.com" <david.holmes@oracle.com>, " hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, Kim Barrett <kim.barrett@oracle.com>, "ppc-aix-port-dev@openjdk.java.net" < ppc-aix-port-dev@openjdk.java.net> Date: 2018/06/02 00:15 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Michihiro, Looks good to me. Thanks, /Erik On 2018-06-01 17:08, Michihiro Horie wrote: Hi Kim, Erik, and Martin, Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume. I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM. Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites. I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough. Best regards, -- Michihiro, IBM Research - Tokyo Inactive hide details for "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr". From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Erik, the current implementation works on PPC because of "MP+sync +addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists. PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC. But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume. Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha. Thanks and best regards, Martin -----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Martin and Michihiro, On 2018-05-29 12:30, Doerr, Martin wrote: > Hi Kim, > > I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. > So it sounds like a request to fix the current implementation in addition to what his original intend was. I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose. In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data. Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions. > Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. > What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? > The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it. Sure, that sounds good to me. Thanks, /Erik > Thanks and best regards, > Martin > > > -----Original Message----- > From: Kim Barrett [mailto:kim.barrett@oracle.com] > Sent: Dienstag, 29. Mai 2018 01:54 > To: Michihiro Horie <HORIE@jp.ibm.com> > Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> > Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 > >> On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote: >> >> Hi Erik, >> >> Thank you very much for your review. >> >> I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire. >> >> New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 > This is missing the acquire barrier on the else branch for the initial test, so fails to meet > the previously described minimal requirements for even possibly being sufficient. Any > analysis of weakening the CAS barriers must consider that test and successor code. > > In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be > considered; the forwardee is being returned to caller(s) that will presumably do something > with that object. > > Since the whole point of this discussion is performance, any proposed change should come > with performance information. >
On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Kim, Erik, and Martin,
Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM.
Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Looks good.
Best regards, -- Michihiro, IBM Research - Tokyo
"Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr".
From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett <kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com> Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Erik,
the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists.
PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC.
But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume.
Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha.
Thanks and best regards, Martin
-----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Martin and Michihiro,
On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose.
In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data.
Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me.
Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Kim, Erik, and Martin,
Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/
This change uses forwardee_acquire(), which would generate better code on ARM.
Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Looks good.
Thanks a lot, Kim! Best regards, -- Michihiro, IBM Research - Tokyo From: Kim Barrett <kim.barrett@oracle.com> To: Michihiro Horie <HORIE@jp.ibm.com> Cc: "Doerr, Martin" <martin.doerr@sap.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com>, "david.holmes@oracle.com" <david.holmes@oracle.com>, "Erik Österlund" <erik.osterlund@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/06/05 05:08 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Kim, Erik, and Martin,
Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/
This change uses forwardee_acquire(), which would generate better code on ARM.
Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Looks good.
Best regards, -- Michihiro, IBM Research - Tokyo
"Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current
implementation works on PPC because of "MP+sync+addr".
From: "Doerr, Martin" <martin.doerr@sap.com> To: "Erik Österlund" <erik.osterlund@oracle.com>, Kim Barrett
<kim.barrett@oracle.com>, Michihiro Horie <HORIE@jp.ibm.com>, "Andrew Haley (aph@redhat.com)" <aph@redhat.com>
Cc: "david.holmes@oracle.com" <david.holmes@oracle.com>, "hotspot-gc-dev@openjdk.java.net" <hotspot-gc-dev@openjdk.java.net>, "ppc-aix-port-dev@openjdk.java.net" <ppc-aix-port-dev@openjdk.java.net> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Erik,
the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists.
PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC.
But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume.
Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha.
Thanks and best regards, Martin
-----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com>; Kim Barrett <kim.barrett@oracle.com>; Michihiro Horie <HORIE@jp.ibm.com> Cc: david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Martin and Michihiro,
On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose.
In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data.
Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" ( https://dl.acm.org/citation.cfm?id=1993520 ), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me.
Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: Erik Osterlund <erik.osterlund@oracle.com>; david.holmes@oracle.com; Gustavo Bueno Romero <gromero@br.ibm.com>; hotspot-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; ppc-aix-port-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10
This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
Thanks for the contribution, and thanks everybody for discussing and reviewing. I’ve pushed it. Best regards, Martin From: Michihiro Horie [mailto:HORIE@jp.ibm.com] Sent: Dienstag, 5. Juni 2018 00:42 To: Kim Barrett <kim.barrett@oracle.com> Cc: Andrew Haley (aph@redhat.com) <aph@redhat.com>; david.holmes@oracle.com; Erik Österlund <erik.osterlund@oracle.com>; hotspot-gc-dev@openjdk.java.net; Doerr, Martin <martin.doerr@sap.com>; ppc-aix-port-dev@openjdk.java.net Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> wrote:
Hi Kim, Erik, and Martin,
Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM.
Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Looks good.
Thanks a lot, Kim! Best regards, -- Michihiro, IBM Research - Tokyo [Inactive hide details for Kim Barrett ---2018/06/05 05:08:48---> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com]Kim Barrett ---2018/06/05 05:08:48---> On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> wrote: > From: Kim Barrett <kim.barrett@oracle.com<mailto:kim.barrett@oracle.com>> To: Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> Cc: "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>>, "Andrew Haley (aph@redhat.com<mailto:aph@redhat.com>)" <aph@redhat.com<mailto:aph@redhat.com>>, "david.holmes@oracle.com<mailto:david.holmes@oracle.com>" <david.holmes@oracle.com<mailto:david.holmes@oracle.com>>, "Erik Österlund" <erik.osterlund@oracle.com<mailto:erik.osterlund@oracle.com>>, "hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>" <hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>>, "ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>" <ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>> Date: 2018/06/05 05:08 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 ________________________________
On Jun 1, 2018, at 11:08 AM, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> wrote:
Hi Kim, Erik, and Martin,
Thank you very much for reminding me that an acquire barrier in the else-statement for “!test_mark->is_marked()” is necessary under the criteria of not relying on the consume.
I uploaded a new webrev : http://cr.openjdk.java.net/~mhorie/8154736/webrev.13/ This change uses forwardee_acquire(), which would generate better code on ARM.
Necessary barriers are located in all the paths in copy_to_survivor_space, and the returned new_obj can be safely handled in the caller sites.
I measured SPECjbb2015 with the latest webrev. Critical-jOPS improved by 5%. Since my previous measurement with implicit consume showed 6% improvement, adding acquire barriers degraded the performance a little, but 5% is still good enough.
Looks good.
Best regards, -- Michihiro, IBM Research - Tokyo
"Doerr, Martin" ---2018/05/30 16:18:09---Hi Erik, the current implementation works on PPC because of "MP+sync+addr".
From: "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> To: "Erik Österlund" <erik.osterlund@oracle.com<mailto:erik.osterlund@oracle.com>>, Kim Barrett <kim.barrett@oracle.com<mailto:kim.barrett@oracle.com>>, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>>, "Andrew Haley (aph@redhat.com<mailto:aph@redhat.com>)" <aph@redhat.com<mailto:aph@redhat.com>> Cc: "david.holmes@oracle.com<mailto:david.holmes@oracle.com>" <david.holmes@oracle.com<mailto:david.holmes@oracle.com>>, "hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>" <hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>>, "ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>" <ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>> Date: 2018/05/30 16:18 Subject: RE: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Erik,
the current implementation works on PPC because of "MP+sync+addr". So we already rely on ordering of "load volatile field" + "implicit consume" on the reader's side. We have never seen any issues related to this with the compilers we have been using during the ~10 years the PPC implementation exists.
PPC supports "MP+lwsync+addr" the same way, so Michihiro's proposal doesn't make it unreliable for PPC.
But I'm ok with evaluating acquire barriers although they are not required by the PPC/ARM memory models. ARM/aarch64 will also be affected when the o->forwardee uses load_acquire. So somebody should check the impact. If it is not acceptable we may need to introduce explicit consume.
Implicit consume is also bad in shared code because somebody may want to run it on DEC Alpha.
Thanks and best regards, Martin
-----Original Message----- From: Erik Österlund [mailto:erik.osterlund@oracle.com] Sent: Dienstag, 29. Mai 2018 14:01 To: Doerr, Martin <martin.doerr@sap.com<mailto:martin.doerr@sap.com>>; Kim Barrett <kim.barrett@oracle.com<mailto:kim.barrett@oracle.com>>; Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> Cc: david.holmes@oracle.com<mailto:david.holmes@oracle.com>; Gustavo Bueno Romero <gromero@br.ibm.com<mailto:gromero@br.ibm.com>>; hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>; hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
Hi Martin and Michihiro,
On 2018-05-29 12:30, Doerr, Martin wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS. So it sounds like a request to fix the current implementation in addition to what his original intend was.
I think we are just trying to nail down the correct fencing and just go for that. And yes, this is arguably a pre-existing problem, but in a race involving the very same accesses that we are changing the fencing for. So it is not completely unrelated I suppose.
In particular, hotspot has code that assumes that if you on the writer side issue a full fence before publishing a pointer to newly initialized data, then the initializing stores and their side effects should be globally "visible" across the system before the pointer to it is published, and hence elide the need for acquire on the loading side, without relying on retained data dependencies on the loader side. I believe this code falls under that category. It is assumed that the leading fence of the CAS publishing the forwarding pointer makes the initializing stores globally observable before publishing a pointer to the initialized data, hence assuming that any loads able to observe the new pointer would not rely on acquire or data dependent loads to correctly read the initialized data.
Unfortunately, this is not reliable in the IRIW case, as per the litmus test "MP+sync+ctrl" as described in "Understanding POWER multiprocessors" (https://dl.acm.org/citation.cfm?id=1993520), as opposed to "MP+sync+addr" that gets away with it because of the data dependency (not IRIW). Similarly, an isync does the job too on the reader side as shown in MP+sync+ctrlisync. So while what I believe was the previous reasoning that the leading sync of the CAS would elide the necessity for acquire on the reader side without relying on data dependent loads (implicit consume), I think that assumption was wrong in the first place and that we do indeed need explicit acquire (even with the precious conservative CAS fencing) in this context to not rely on implicit consume semantics generating the required data dependent loads on the reader side. In practice though, the leading sync of the CAS has been enough to generate the correct machine code. Now, with the leading sync removed, we are increasing the possible holes in the generated machine code due to this flawed reasoning. So it would be nice to do something more sound instead that does not make such assumptions.
Anyway, I agree with that implicit consume is not good. And I think it would be good to treat both o->forwardee() the same way. What about keeping memory_order_release for the CAS and using acquire for both o->forwardee()? The case in which the CAS succeeds is safe because the current thread has created new_obj so it doesn't need memory barriers to access it.
Sure, that sounds good to me.
Thanks, /Erik
Thanks and best regards, Martin
-----Original Message----- From: Kim Barrett [mailto:kim.barrett@oracle.com] Sent: Dienstag, 29. Mai 2018 01:54 To: Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> Cc: Erik Osterlund <erik.osterlund@oracle.com<mailto:erik.osterlund@oracle.com>>; david.holmes@oracle.com<mailto:david.holmes@oracle.com>; Gustavo Bueno Romero <gromero@br.ibm.com<mailto:gromero@br.ibm.com>>; hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>; hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>; Doerr, Martin <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 28, 2018, at 4:12 AM, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> wrote:
Hi Erik,
Thank you very much for your review.
I understood that implicit consume should not be used in the shared code. Also, I believe performance degradation would be negligible even if we use acquire.
New webrev uses memory_order_acq_rel: http://cr.openjdk.java.net/~mhorie/8154736/webrev.10 This is missing the acquire barrier on the else branch for the initial test, so fails to meet the previously described minimal requirements for even possibly being sufficient. Any analysis of weakening the CAS barriers must consider that test and successor code.
In the analysis, it’s not just the lexically nearby debugging / logging code that needs to be considered; the forwardee is being returned to caller(s) that will presumably do something with that object.
Since the whole point of this discussion is performance, any proposed change should come with performance information.
On May 29, 2018, at 6:30 AM, Doerr, Martin <martin.doerr@sap.com> wrote:
Hi Kim,
I'm trying to understand how this is related to Michihiro's change. The else path of the initial test is not affected by it AFAICS.
The value being tested and the forwardee obtained in the else branch are the result of some earlier CAS, e.g. some CAS (in this thread or some other) has already won the race and the result is seen here. Weakening the CAS fencing applies to that earlier CAS too, and so affects this code path.
Hi everybody, thank you very much for discussing this issue and helping to fix the PPC64 performance bottleneck. Thanks, Kim and Erik, for explaining by which trick we’re using the CAS’ acquire barrier in current code. Thanks, Michihiro, for explaining your current proposal which relies on consume. I’m convinced that it works for PPC and ARM (and of course for the strong memory model platforms). If I understand it correctly, acquire is desired to help compilers and the hardware barrier is not needed. The current implementation just uses our handmade inline asm code, so it’s pointless for the compilers if we use acquire or not. However, if we want to use C++11 atomics instead of our inline asm code in the future, I think memory_order_acq_rel will be recommended to avoid compiler problems. Was this the reason for the acquire or did I miss anything? From performance point of view, I think we can live with an unnecessary acquire barrier. It’s so much cheaper than a full fence. So if this is the only remaining issue, I think we could just add it. Btw. there are places in shared code where we rely on consume. I had found one and added a comment some time ago (compiledMethod.hpp): “Note: _exception_cache may be read concurrently. We rely on memory_order_consume here.” Seems to work correctly with volatile fields. Best regards, Martin From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Erik Osterlund Sent: Montag, 28. Mai 2018 08:48 To: Michihiro Horie <HORIE@jp.ibm.com> Cc: hotspot-dev@openjdk.java.net; Kim Barrett <kim.barrett@oracle.com>; Gustavo Bueno Romero <gromero@br.ibm.com>; ppc-aix-port-dev@openjdk.java.net; hotspot-gc-dev@openjdk.java.net; david.holmes@oracle.com Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 Hi Michihiro, In your analysis, you state that the failing CAS path today already relies on implicit consume ordering as reading forwardee() after the failed CAS is missing acquire and hence accesses into the new reloaded forwardee would rely on (implicit) data dependencies to the reloaded forwardee. That part of the analysis seems wrong to me. Since today even a failed CAS has acquire semantics (and stronger), and the reloaded forwardee always has the same value as was observed in the failed cas (in this context), all data dependency requirements to the reloaded forwardee are therefore no longer needed or relied upon. We do not use implicit consume in the shared C++ code. If you find any instances of that, it is a bug and should be purged with fire. Even explicit consume is currently strongly discouraged. Implicit consume is unreliable, especially in a project with many platforms. If you insist on using more fragile semantics that are known to be unreliable, I would like to at least know what measurable performance difference you observe between the semantics Kim proposed, compared to the elided acquire variant you insist on. My gut feeling tells me that double sync is very intrusive, but an isync scheduled almost immediately after an lwsync, should be significantly less intrusive. Thanks, /Erik On 28 May 2018, at 03:28, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> wrote: Hi Kim,
I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of
122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 }
We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Thank you for the discussions on your side with the GC team. I summarized the point on why my change works as follows. Hope we are on the same page with this. 1. Current implementation PSPromotionManager::copy_to_survivor_space is used to move live objects to a different location. It uses a forwarding technique and allows multiple threads to compete for performing the copy step. The first thread succeeds in installing its copy in the old object as forwardee. Other threads may need to discard their copy and use the one generated by the first thread which has won the race. Written program order: (1) create new_obj as copy of obj (2) full fence (3) CAS to set the forwardee with new_obj (4) full fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on When thread0 succeeds in CAS at (3), the copied new_obj by thread0 must be accessible from thread1 at (6). (2) guarantees the order of (1) and (3), although it is stronger than needed for the purpose of ensuring a consistent view of copied new_obj from thread1. (5), (6), and (7) must be executed after (3). Apparently, (4) looks guranteeing the order, although it is redundant. The order of (6) and (7) is guaranteed by consume. (5) and (6) are on different control paths. (5) and (7): Thread0 owns new_obj when CAS succeeded and can access it without barrier. 2. Proposed change Written program order: (1) create new_obj as copy of obj (2) release fence (3) CAS to set the forwardee with new_obj (4) no fence (5) access to the new_obj's field if CAS succeeds (6) access to the forwardee with "o->forwardee()" if CAS fails (7) access to the forwardee's field if debugging is on Release fence at (2) is sufficient to make the copied new_obj accessible from a thread that fails in CAS. No fence at (4) is acceptable because it is redundant. The order of (5), (6), and (7) is the same as the current implementation. It is not affected by the proposed change. 3. Reason why this is sufficient Memory coherence guarantees that all the threads share a consistent view on the access to the same memory location, which is "_mark" in the target code. Thread0 writes the "_mark" when it succeeds in CAS at (3) and thread1 reads the "_mark" when it failes in CAS at (3). Thread1 also reads the "_mark" by invoking "o->forwardee()" at (6). (See CoRR1 in Section 8 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf) Also, compilers do not speculatively load "o->forwardee()" at (6) before the CAS at (3). This is ensured by the integrated compiler barriers (clobber "memory" in the volatile inline asm code). And it is also prevented because "_mark" is declared volatile. Best regards, -- Michihiro, IBM Research - Tokyo <graycol.gif>Kim Barrett ---2018/05/26 01:01:40---> On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> wrote: > From: Kim Barrett <kim.barrett@oracle.com<mailto:kim.barrett@oracle.com>> To: "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> Cc: Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>>, "hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>" <hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>>, "hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>" <hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>>, Gustavo Bueno Romero <gromero@br.ibm.com<mailto:gromero@br.ibm.com>>, "ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>" <ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>>, "david.holmes@oracle.com<mailto:david.holmes@oracle.com>" <david.holmes@oracle.com<mailto:david.holmes@oracle.com>> Date: 2018/05/26 01:01 Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64 ________________________________
On May 22, 2018, at 12:16 PM, Doerr, Martin <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> wrote:
Hi Kim,
I can't see how a new implicit consume is introduced by Michihiro's change. He just explained how the existing code works.
If implicit consume has been rejected the current code is wrong: "new_obj = o->forwardee();" would need some kind of barrier before using the new_obj.
But this issue is not related to what Michihiro wants to change AFAICS.
The current full-fence CAS guarantees the stores into the new forwardee installed by the CAS happen before loads from that object after the CAS. Algorithmically, o->forwardee() is guaranteed to be the same object as was returned by the CAS. Hence, loads from the forwardee are being ordered by the fenced CAS. I've discussed this with others on the GC team; we think the minimal required barriers are CAS with memory_order_acq_rel, plus an acquire barrier on the else branch of 122 if (!test_mark->is_marked()) { ... 261 } else { 262 assert(o->is_forwarded(), "Sanity"); 263 new_obj = o->forwardee(); 264 } We've not done enough analysis to show this is sufficient, but we think anything weaker is not sufficient for shared code.
Best regards, Martin
-----Original Message----- From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-bounces@openjdk.java.net] On Behalf Of Kim Barrett Sent: Montag, 21. Mai 2018 06:00 To: Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> Cc: hotspot-dev@openjdk.java.net<mailto:hotspot-dev@openjdk.java.net>; hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; Gustavo Bueno Romero <gromero@br.ibm.com<mailto:gromero@br.ibm.com>>; ppc-aix-port-dev@openjdk.java.net<mailto:ppc-aix-port-dev@openjdk.java.net>; david.holmes@oracle.com<mailto:david.holmes@oracle.com> Subject: Re: RFR(M): 8154736: enhancement of cmpxchg and copy_to_survivor for ppc64
On May 18, 2018, at 5:12 PM, Michihiro Horie <HORIE@jp.ibm.com<mailto:HORIE@jp.ibm.com>> wrote:
Dear all,
I update the webrev: http://cr.openjdk.java.net/~mhorie/8154736/webrev.09/
With the release barrier before the CAS, new_obj can be observed from other threads. If the CAS succeeds, the current thread can use new_obj without barriers. If the CAS fails, "o->forwardee()" is ordered with respect to CAS by accessing the same memory location "_mark", so no barriers needed. The order of (1) access to the forwardee and (2) access to forwardee's fields is preserved due to Release-Consume ordering on supported platforms. (The ordering between "new_obj = o->forwardee();" and logging or other usages is not changed.)
Regarding the maintainability, the requirement is CAS(memory_order_release) as Release-Consume to be consistent with C++11. This requirement is necessary when a weaker platform like DEC Alpha is to be supported. On currently supported platforms, code change can be safe if the code meats this requirement (and the order of (1) access to the forwardee and (2) access to forwardee's fields is the natural way of coding).
Relying on implicit consume has been been discussed and rejected, in the earlier thread on this topic and I think elsewhere too.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/0215...
participants (7)
-
David Holmes
-
Doerr, Martin
-
Erik Osterlund
-
Gustavo Romero
-
John Paul Adrian Glaubitz
-
Kim Barrett
-
Michihiro Horie