RFR(XL): 8224675: Late GC barrier insertion for ZGC

Wed May 29 16:06:42 UTC 2019

Hello Nils,
   I've tested with JTReg and JCStress, with and without
-XX:+UseBarriersWithVolatile. I found one error, which was a typo on
my part in z_compareAndExchangeP. I've fixed it and JCStress is now
clean.
Your changes are good as far as my testing is concerned. I'm
continuing to look at the code that is generated.


diff -r d48227fa72cf src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad

--- a/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Wed May 29 14:53:47 2019 +0100
+++ b/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Wed May 29 16:39:59 2019 +0100
@@ -56,6 +56,8 @@
 //
 instruct loadBarrierSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
   match(Set dst (LoadBarrierSlowReg mem));
+  predicate(!n->as_LoadBarrierSlowReg()->is_weak());
+
   effect(DEF dst, KILL cr);

   format %{"LoadBarrierSlowReg $dst, $mem" %}
@@ -70,7 +72,8 @@
 // Execute ZGC load barrier (weak) slow path
 //
 instruct loadBarrierWeakSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
-  match(Set dst (LoadBarrierWeakSlowReg mem));
+  match(Set dst (LoadBarrierSlowReg mem));
+  predicate(n->as_LoadBarrierSlowReg()->is_weak());

   effect(DEF dst, KILL cr);

@@ -81,3 +84,60 @@
   %}
   ins_pipe(pipe_slow);
 %}
+
+
+// Specialized versions of compareAndExchangeP that adds a keepalive
that is consumed
+// but doesn't affect output.
+
+instruct z_compareAndExchangeP(iRegPNoSp res, indirect mem,
+                               iRegP oldval, iRegP newval, iRegP keepalive,
+                               rFlagsReg cr) %{
+  match(Set res (ZCompareAndExchangeP (Binary mem keepalive) (Binary
oldval newval)));
+  ins_cost(2 * VOLATILE_REF_COST);
+  effect(TEMP_DEF res, KILL cr);
+  format %{
+    "cmpxchg $res = $mem, $oldval, $newval\t# (ptr, weak) if $mem ==
$oldval then $mem <-- $newval"
+  %}
+  ins_encode %{
+    __ cmpxchg($mem$$Register, $oldval$$Register, $newval$$Register,
+               Assembler::xword, /*acquire*/ false, /*release*/ true,
+               /*weak*/ false, $res$$Register);
+  %}
+  ins_pipe(pipe_slow);
+%}
+
+instruct z_compareAndSwapP(iRegINoSp res,
+                           indirect mem,
+                           iRegP oldval, iRegP newval, iRegP keepalive,
+                            rFlagsReg cr) %{
+
+  match(Set res (ZCompareAndSwapP (Binary mem keepalive) (Binary
oldval newval)));
+  match(Set res (ZWeakCompareAndSwapP (Binary mem keepalive) (Binary
oldval newval)));
+
+  ins_cost(2 * VOLATILE_REF_COST);
+
+  effect(KILL cr);
+
+ format %{
+    "cmpxchg $mem, $oldval, $newval\t# (ptr) if $mem == $oldval then
$mem <-- $newval"
+    "cset $res, EQ\t# $res <-- (EQ ? 1 : 0)"
+ %}
+
+ ins_encode(aarch64_enc_cmpxchg(mem, oldval, newval),
+            aarch64_enc_cset_eq(res));
+
+  ins_pipe(pipe_slow);
+%}
+
+
+instruct z_get_and_setP(indirect mem, iRegP newv, iRegPNoSp prev,
+                        iRegP keepalive) %{
+  match(Set prev (ZGetAndSetP mem (Binary newv keepalive)));
+
+  ins_cost(2 * VOLATILE_REF_COST);
+  format %{ "atomic_xchg  $prev, $newv, [$mem]" %}
+  ins_encode %{
+    __ atomic_xchg($prev$$Register, $newv$$Register, as_Register($mem$$base));
+  %}
+  ins_pipe(pipe_serial);
+%}

On Fri, 24 May 2019 at 15:37, Stuart Monteith
<stuart.monteith at linaro.org> wrote:
>
> That's interesting, and seems beneficial for ZGC on aarch64, where
> before your patch the ZGC load barriers broke assumptions the
> memory-fence optimisation code was making.
>
> I'm currently testing your patch, with the following put on top for aarch64:
>
> diff -r ead187ebe684 src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad
> --- a/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Fri May 24 13:11:48 2019 +0100
> +++ b/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Fri May 24 15:34:17 2019 +0100
> @@ -56,6 +56,8 @@
>  //
>  instruct loadBarrierSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
>    match(Set dst (LoadBarrierSlowReg mem));
> +  predicate(!n->as_LoadBarrierSlowReg()->is_weak());
> +
>    effect(DEF dst, KILL cr);
>
>    format %{"LoadBarrierSlowReg $dst, $mem" %}
> @@ -70,7 +72,8 @@
>  // Execute ZGC load barrier (weak) slow path
>  //
>  instruct loadBarrierWeakSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
> -  match(Set dst (LoadBarrierWeakSlowReg mem));
> +  match(Set dst (LoadBarrierSlowReg mem));
> +  predicate(n->as_LoadBarrierSlowReg()->is_weak());
>
>    effect(DEF dst, KILL cr);
>
> @@ -81,3 +84,60 @@
>    %}
>    ins_pipe(pipe_slow);
>  %}
> +
> +
> +// Specialized versions of compareAndExchangeP that adds a keepalive
> that is consumed
> +// but doesn't affect output.
> +
> +instruct z_compareAndExchangeP(iRegPNoSp res, indirect mem,
> +                               iRegP oldval, iRegP newval, iRegP keepalive,
> +                               rFlagsReg cr) %{
> +  match(Set oldval (ZCompareAndExchangeP (Binary mem keepalive)
> (Binary oldval newval)));
> +  ins_cost(2 * VOLATILE_REF_COST);
> +  effect(TEMP_DEF res, KILL cr);
> +  format %{
> +    "cmpxchg $res = $mem, $oldval, $newval\t# (ptr, weak) if $mem ==
> $oldval then $mem <-- $newval"
> +  %}
> +  ins_encode %{
> +    __ cmpxchg($mem$$Register, $oldval$$Register, $newval$$Register,
> +               Assembler::xword, /*acquire*/ false, /*release*/ true,
> +               /*weak*/ false, $res$$Register);
> +  %}
> +  ins_pipe(pipe_slow);
> +%}
> +
> +instruct z_compareAndSwapP(iRegINoSp res,
> +                           indirect mem,
> +                           iRegP oldval, iRegP newval, iRegP keepalive,
> +                            rFlagsReg cr) %{
> +
> +  match(Set res (ZCompareAndSwapP (Binary mem keepalive) (Binary
> oldval newval)));
> +  match(Set res (ZWeakCompareAndSwapP (Binary mem keepalive) (Binary
> oldval newval)));
> +
> +  ins_cost(2 * VOLATILE_REF_COST);
> +
> +  effect(KILL cr);
> +
> + format %{
> +    "cmpxchg $mem, $oldval, $newval\t# (ptr) if $mem == $oldval then
> $mem <-- $newval"
> +    "cset $res, EQ\t# $res <-- (EQ ? 1 : 0)"
> + %}
> +
> + ins_encode(aarch64_enc_cmpxchg(mem, oldval, newval),
> +            aarch64_enc_cset_eq(res));
> +
> +  ins_pipe(pipe_slow);
> +%}
> +
> +
> +instruct z_get_and_setP(indirect mem, iRegP newv, iRegPNoSp prev,
> +                        iRegP keepalive) %{
> +  match(Set prev (ZGetAndSetP mem (Binary newv keepalive)));
> +
> +  ins_cost(2 * VOLATILE_REF_COST);
> +  format %{ "atomic_xchg  $prev, $newv, [$mem]" %}
> +  ins_encode %{
> +    __ atomic_xchg($prev$$Register, $newv$$Register, as_Register($mem$$base));
> +  %}
> +  ins_pipe(pipe_serial);
> +%}
> \ No newline at end of file
>
> On Thu, 23 May 2019 at 15:38, Nils Eliasson <nils.eliasson at oracle.com> wrote:
> >
> > Hi,
> >
> > In ZGC we use load barriers on references. In the original
> > implementation these where added as macro nodes at parse time. The load
> > barrier node consumes and produces control flow in order to be able to
> > be lowered into a check with a slow path late. The load barrier nodes
> > are fixed in the control flow, and extensions to different optimizations
> > are need the barriers out of loop and past other unrelated control flow.
> >
> > With this patch the barriers are instead added after the loop
> > optimizations, before macro node expansion. This makes the entire
> > pipeline until that point oblivious about the barriers. A dump of the IR
> > with ZGC or EpsilonGC will be basically identical at that point, and the
> > diff compared to serialGC or ParallelGC that use write barriers is
> > really small.
> >
> > Benefits
> >
> > - A major complexity reduction. One can reason about and implement loop
> > optimization without caring about the barriers. The escape analysis
> > doesn't need to know about the barriers. Loads float freely like they
> > are supposed to.
> >
> > - Less nodes early. The inlining will become more deterministic. A
> > barrier heavy GC will not run into node limits earlier. Also node limit
> > bounded optimization like unrolling and peeling will not be penalized by
> > barriers.
> >
> > - Better test coverage, or reduce testing cost when the same
> > optimization doesn't need to be verified with every GC.
> >
> > - Better control on where barriers end up. It is trivial to guarantee
> > that the load and barriers are not separated by a safepoint.
> >
> > Design
> >
> > The implementation uses an extra phase that piggy back on PhaseIdealLoop
> > which provides control and dominator information for all loads. This
> > extra phase is needed because we need to splice the control flow when
> > adding the load barriers.
> >
> > Barriers are inserted on the loads nodes in post order (any successor
> > first). This is to guarantee the dominator information above every
> > insertion is correct. This is also important within blocks. Two loads in
> > the same block can float in relation to each other. The addition of
> > barriers serializes their order. Any def-use relationship is upheld by
> > expanding them post order.
> >
> > Barrier insertion is done in stages. In this first stage a single macro
> > node that represents the barrier is added with all dependencies that is
> > required. In the macro expansion phase the barrier nodes is expanded
> > into the final shape, adding nodes that represent the conditional load
> > barrier check. (Write barriers in other GCs could possibly be expanded
> > here directly)
> >
> > All the barriers that are needed for unsafe reference operations (cas,
> > swap, cmpx) are also expanded late. They already have control flow, so
> > the expansion is straight forward.
> >
> > The barriers for the unsafe reference operations (cas, getandset, cmpx)
> > have also been simplified. The cas-load-cas dance have been replaced by
> > a pre-load. The pre-load is a load with a barrier, that is kept alive by
> > an extra (required) edge on the unsafe-primitive-nodes (specialized as
> > ZCompareAndSwap, ZGetAndSet, ZCompareAndExchange).
> >
> > One challenge that was encountered early and that have caused
> > considerable work is that nodes (like loads) can end up between calls
> > and their catch projections. This is usually handled after matching, in
> > PhaseCFG::call_catch_cleanup, where the nodes after the call are cloned
> > to all catch blocks. At this stage they are in an ordered list, so that
> > is a straight forward process. For late barrier insertion we need to
> > splice in control earlier, before matching, and control flow between
> > calls and catches is not allowed. This requires us to add a
> > transformation pass where all loads and their dependent instructions are
> > cloned out to the catch blocks before we can start splicing in control
> > flow. This transformation doesn't replace the legacy call_catch_cleanup
> > fully, but it could be a future goal.
> >
> > In the original barrier implementation there where two different load
> > barrier implementations: the basic and the optimized. With the new
> > approach to barriers on unsafe, the basic is no longer required and has
> > been removed. (It provided options for skipping the self healing, and
> > passed the ref in a register, guaranteeing that the oop wasn't reloaded.)
> >
> > The wart that was fixup_partial_loads in zHeap has also been made
> > redundant.
> >
> > Dominating barriers are no longer removed on weak loads. Weak barriers
> > doesn't guarantee self-healing.
> >
> > Follow up work:
> >
> > - Consolidate all uses of GrowableArray::insert_sorted to use the new
> > version
> >
> > - Refactor the phases. There are a lot of simplifications and
> > verification that can be done with more well defined phases.
> >
> > - Simplify the remaining barrier optimizations. There might still be
> > code paths that are no longer needed.
> >
> >
> > Testing:
> >
> > Hotspot tier 1-6, CTW, jcstress, micros, runthese, kitchensink, and then
> > some. All with -XX:+ZVerifyViews.
> >
> > Bug: https://bugs.openjdk.java.net/browse/JDK-8224675
> >
> > Webrev: http://cr.openjdk.java.net/~neliasso/8224675/webrev.01/
> >
> >
> > Please review,
> >
> > Regards,
> >
> > Nils
> >