RFR(XL): 8224675: Late GC barrier insertion for ZGC

Fri May 31 14:40:24 UTC 2019

Great!

Nice to see that the patch is working well for you.

The latest webrev is webrev.03. It has all the fixes mentioned in this 
thread, and is based upon jdk/jdk today after my fix for "8224538: 
LoadBarrierNode::common_barrier must check address"

Best regards,

Nils

On 2019-05-31 16:31, Stuart Monteith wrote:
> Incidentally, my plan is to incorporate the changes necessary for
> aarch64 for "Late GC barrier insertion for ZGC" into my patch for
> implementing ZGC on aarch64, and hopefully get that merged for JDK13.
>
> BR,
>     Stuart
>
> On Wed, 29 May 2019 at 17:06, Stuart Monteith
> <stuart.monteith at linaro.org> wrote:
>> Hello Nils,
>>     I've tested with JTReg and JCStress, with and without
>> -XX:+UseBarriersWithVolatile. I found one error, which was a typo on
>> my part in z_compareAndExchangeP. I've fixed it and JCStress is now
>> clean.
>> Your changes are good as far as my testing is concerned. I'm
>> continuing to look at the code that is generated.
>>
>>
>> diff -r d48227fa72cf src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad
>> --- a/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Wed May 29 14:53:47 2019 +0100
>> +++ b/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Wed May 29 16:39:59 2019 +0100
>> @@ -56,6 +56,8 @@
>>   //
>>   instruct loadBarrierSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
>>     match(Set dst (LoadBarrierSlowReg mem));
>> +  predicate(!n->as_LoadBarrierSlowReg()->is_weak());
>> +
>>     effect(DEF dst, KILL cr);
>>
>>     format %{"LoadBarrierSlowReg $dst, $mem" %}
>> @@ -70,7 +72,8 @@
>>   // Execute ZGC load barrier (weak) slow path
>>   //
>>   instruct loadBarrierWeakSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
>> -  match(Set dst (LoadBarrierWeakSlowReg mem));
>> +  match(Set dst (LoadBarrierSlowReg mem));
>> +  predicate(n->as_LoadBarrierSlowReg()->is_weak());
>>
>>     effect(DEF dst, KILL cr);
>>
>> @@ -81,3 +84,60 @@
>>     %}
>>     ins_pipe(pipe_slow);
>>   %}
>> +
>> +
>> +// Specialized versions of compareAndExchangeP that adds a keepalive
>> that is consumed
>> +// but doesn't affect output.
>> +
>> +instruct z_compareAndExchangeP(iRegPNoSp res, indirect mem,
>> +                               iRegP oldval, iRegP newval, iRegP keepalive,
>> +                               rFlagsReg cr) %{
>> +  match(Set res (ZCompareAndExchangeP (Binary mem keepalive) (Binary
>> oldval newval)));
>> +  ins_cost(2 * VOLATILE_REF_COST);
>> +  effect(TEMP_DEF res, KILL cr);
>> +  format %{
>> +    "cmpxchg $res = $mem, $oldval, $newval\t# (ptr, weak) if $mem ==
>> $oldval then $mem <-- $newval"
>> +  %}
>> +  ins_encode %{
>> +    __ cmpxchg($mem$$Register, $oldval$$Register, $newval$$Register,
>> +               Assembler::xword, /*acquire*/ false, /*release*/ true,
>> +               /*weak*/ false, $res$$Register);
>> +  %}
>> +  ins_pipe(pipe_slow);
>> +%}
>> +
>> +instruct z_compareAndSwapP(iRegINoSp res,
>> +                           indirect mem,
>> +                           iRegP oldval, iRegP newval, iRegP keepalive,
>> +                            rFlagsReg cr) %{
>> +
>> +  match(Set res (ZCompareAndSwapP (Binary mem keepalive) (Binary
>> oldval newval)));
>> +  match(Set res (ZWeakCompareAndSwapP (Binary mem keepalive) (Binary
>> oldval newval)));
>> +
>> +  ins_cost(2 * VOLATILE_REF_COST);
>> +
>> +  effect(KILL cr);
>> +
>> + format %{
>> +    "cmpxchg $mem, $oldval, $newval\t# (ptr) if $mem == $oldval then
>> $mem <-- $newval"
>> +    "cset $res, EQ\t# $res <-- (EQ ? 1 : 0)"
>> + %}
>> +
>> + ins_encode(aarch64_enc_cmpxchg(mem, oldval, newval),
>> +            aarch64_enc_cset_eq(res));
>> +
>> +  ins_pipe(pipe_slow);
>> +%}
>> +
>> +
>> +instruct z_get_and_setP(indirect mem, iRegP newv, iRegPNoSp prev,
>> +                        iRegP keepalive) %{
>> +  match(Set prev (ZGetAndSetP mem (Binary newv keepalive)));
>> +
>> +  ins_cost(2 * VOLATILE_REF_COST);
>> +  format %{ "atomic_xchg  $prev, $newv, [$mem]" %}
>> +  ins_encode %{
>> +    __ atomic_xchg($prev$$Register, $newv$$Register, as_Register($mem$$base));
>> +  %}
>> +  ins_pipe(pipe_serial);
>> +%}
>>
>> On Fri, 24 May 2019 at 15:37, Stuart Monteith
>> <stuart.monteith at linaro.org> wrote:
>>> That's interesting, and seems beneficial for ZGC on aarch64, where
>>> before your patch the ZGC load barriers broke assumptions the
>>> memory-fence optimisation code was making.
>>>
>>> I'm currently testing your patch, with the following put on top for aarch64:
>>>
>>> diff -r ead187ebe684 src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad
>>> --- a/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Fri May 24 13:11:48 2019 +0100
>>> +++ b/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad Fri May 24 15:34:17 2019 +0100
>>> @@ -56,6 +56,8 @@
>>>   //
>>>   instruct loadBarrierSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
>>>     match(Set dst (LoadBarrierSlowReg mem));
>>> +  predicate(!n->as_LoadBarrierSlowReg()->is_weak());
>>> +
>>>     effect(DEF dst, KILL cr);
>>>
>>>     format %{"LoadBarrierSlowReg $dst, $mem" %}
>>> @@ -70,7 +72,8 @@
>>>   // Execute ZGC load barrier (weak) slow path
>>>   //
>>>   instruct loadBarrierWeakSlowReg(iRegP dst, memory mem, rFlagsReg cr) %{
>>> -  match(Set dst (LoadBarrierWeakSlowReg mem));
>>> +  match(Set dst (LoadBarrierSlowReg mem));
>>> +  predicate(n->as_LoadBarrierSlowReg()->is_weak());
>>>
>>>     effect(DEF dst, KILL cr);
>>>
>>> @@ -81,3 +84,60 @@
>>>     %}
>>>     ins_pipe(pipe_slow);
>>>   %}
>>> +
>>> +
>>> +// Specialized versions of compareAndExchangeP that adds a keepalive
>>> that is consumed
>>> +// but doesn't affect output.
>>> +
>>> +instruct z_compareAndExchangeP(iRegPNoSp res, indirect mem,
>>> +                               iRegP oldval, iRegP newval, iRegP keepalive,
>>> +                               rFlagsReg cr) %{
>>> +  match(Set oldval (ZCompareAndExchangeP (Binary mem keepalive)
>>> (Binary oldval newval)));
>>> +  ins_cost(2 * VOLATILE_REF_COST);
>>> +  effect(TEMP_DEF res, KILL cr);
>>> +  format %{
>>> +    "cmpxchg $res = $mem, $oldval, $newval\t# (ptr, weak) if $mem ==
>>> $oldval then $mem <-- $newval"
>>> +  %}
>>> +  ins_encode %{
>>> +    __ cmpxchg($mem$$Register, $oldval$$Register, $newval$$Register,
>>> +               Assembler::xword, /*acquire*/ false, /*release*/ true,
>>> +               /*weak*/ false, $res$$Register);
>>> +  %}
>>> +  ins_pipe(pipe_slow);
>>> +%}
>>> +
>>> +instruct z_compareAndSwapP(iRegINoSp res,
>>> +                           indirect mem,
>>> +                           iRegP oldval, iRegP newval, iRegP keepalive,
>>> +                            rFlagsReg cr) %{
>>> +
>>> +  match(Set res (ZCompareAndSwapP (Binary mem keepalive) (Binary
>>> oldval newval)));
>>> +  match(Set res (ZWeakCompareAndSwapP (Binary mem keepalive) (Binary
>>> oldval newval)));
>>> +
>>> +  ins_cost(2 * VOLATILE_REF_COST);
>>> +
>>> +  effect(KILL cr);
>>> +
>>> + format %{
>>> +    "cmpxchg $mem, $oldval, $newval\t# (ptr) if $mem == $oldval then
>>> $mem <-- $newval"
>>> +    "cset $res, EQ\t# $res <-- (EQ ? 1 : 0)"
>>> + %}
>>> +
>>> + ins_encode(aarch64_enc_cmpxchg(mem, oldval, newval),
>>> +            aarch64_enc_cset_eq(res));
>>> +
>>> +  ins_pipe(pipe_slow);
>>> +%}
>>> +
>>> +
>>> +instruct z_get_and_setP(indirect mem, iRegP newv, iRegPNoSp prev,
>>> +                        iRegP keepalive) %{
>>> +  match(Set prev (ZGetAndSetP mem (Binary newv keepalive)));
>>> +
>>> +  ins_cost(2 * VOLATILE_REF_COST);
>>> +  format %{ "atomic_xchg  $prev, $newv, [$mem]" %}
>>> +  ins_encode %{
>>> +    __ atomic_xchg($prev$$Register, $newv$$Register, as_Register($mem$$base));
>>> +  %}
>>> +  ins_pipe(pipe_serial);
>>> +%}
>>> \ No newline at end of file
>>>
>>> On Thu, 23 May 2019 at 15:38, Nils Eliasson <nils.eliasson at oracle.com> wrote:
>>>> Hi,
>>>>
>>>> In ZGC we use load barriers on references. In the original
>>>> implementation these where added as macro nodes at parse time. The load
>>>> barrier node consumes and produces control flow in order to be able to
>>>> be lowered into a check with a slow path late. The load barrier nodes
>>>> are fixed in the control flow, and extensions to different optimizations
>>>> are need the barriers out of loop and past other unrelated control flow.
>>>>
>>>> With this patch the barriers are instead added after the loop
>>>> optimizations, before macro node expansion. This makes the entire
>>>> pipeline until that point oblivious about the barriers. A dump of the IR
>>>> with ZGC or EpsilonGC will be basically identical at that point, and the
>>>> diff compared to serialGC or ParallelGC that use write barriers is
>>>> really small.
>>>>
>>>> Benefits
>>>>
>>>> - A major complexity reduction. One can reason about and implement loop
>>>> optimization without caring about the barriers. The escape analysis
>>>> doesn't need to know about the barriers. Loads float freely like they
>>>> are supposed to.
>>>>
>>>> - Less nodes early. The inlining will become more deterministic. A
>>>> barrier heavy GC will not run into node limits earlier. Also node limit
>>>> bounded optimization like unrolling and peeling will not be penalized by
>>>> barriers.
>>>>
>>>> - Better test coverage, or reduce testing cost when the same
>>>> optimization doesn't need to be verified with every GC.
>>>>
>>>> - Better control on where barriers end up. It is trivial to guarantee
>>>> that the load and barriers are not separated by a safepoint.
>>>>
>>>> Design
>>>>
>>>> The implementation uses an extra phase that piggy back on PhaseIdealLoop
>>>> which provides control and dominator information for all loads. This
>>>> extra phase is needed because we need to splice the control flow when
>>>> adding the load barriers.
>>>>
>>>> Barriers are inserted on the loads nodes in post order (any successor
>>>> first). This is to guarantee the dominator information above every
>>>> insertion is correct. This is also important within blocks. Two loads in
>>>> the same block can float in relation to each other. The addition of
>>>> barriers serializes their order. Any def-use relationship is upheld by
>>>> expanding them post order.
>>>>
>>>> Barrier insertion is done in stages. In this first stage a single macro
>>>> node that represents the barrier is added with all dependencies that is
>>>> required. In the macro expansion phase the barrier nodes is expanded
>>>> into the final shape, adding nodes that represent the conditional load
>>>> barrier check. (Write barriers in other GCs could possibly be expanded
>>>> here directly)
>>>>
>>>> All the barriers that are needed for unsafe reference operations (cas,
>>>> swap, cmpx) are also expanded late. They already have control flow, so
>>>> the expansion is straight forward.
>>>>
>>>> The barriers for the unsafe reference operations (cas, getandset, cmpx)
>>>> have also been simplified. The cas-load-cas dance have been replaced by
>>>> a pre-load. The pre-load is a load with a barrier, that is kept alive by
>>>> an extra (required) edge on the unsafe-primitive-nodes (specialized as
>>>> ZCompareAndSwap, ZGetAndSet, ZCompareAndExchange).
>>>>
>>>> One challenge that was encountered early and that have caused
>>>> considerable work is that nodes (like loads) can end up between calls
>>>> and their catch projections. This is usually handled after matching, in
>>>> PhaseCFG::call_catch_cleanup, where the nodes after the call are cloned
>>>> to all catch blocks. At this stage they are in an ordered list, so that
>>>> is a straight forward process. For late barrier insertion we need to
>>>> splice in control earlier, before matching, and control flow between
>>>> calls and catches is not allowed. This requires us to add a
>>>> transformation pass where all loads and their dependent instructions are
>>>> cloned out to the catch blocks before we can start splicing in control
>>>> flow. This transformation doesn't replace the legacy call_catch_cleanup
>>>> fully, but it could be a future goal.
>>>>
>>>> In the original barrier implementation there where two different load
>>>> barrier implementations: the basic and the optimized. With the new
>>>> approach to barriers on unsafe, the basic is no longer required and has
>>>> been removed. (It provided options for skipping the self healing, and
>>>> passed the ref in a register, guaranteeing that the oop wasn't reloaded.)
>>>>
>>>> The wart that was fixup_partial_loads in zHeap has also been made
>>>> redundant.
>>>>
>>>> Dominating barriers are no longer removed on weak loads. Weak barriers
>>>> doesn't guarantee self-healing.
>>>>
>>>> Follow up work:
>>>>
>>>> - Consolidate all uses of GrowableArray::insert_sorted to use the new
>>>> version
>>>>
>>>> - Refactor the phases. There are a lot of simplifications and
>>>> verification that can be done with more well defined phases.
>>>>
>>>> - Simplify the remaining barrier optimizations. There might still be
>>>> code paths that are no longer needed.
>>>>
>>>>
>>>> Testing:
>>>>
>>>> Hotspot tier 1-6, CTW, jcstress, micros, runthese, kitchensink, and then
>>>> some. All with -XX:+ZVerifyViews.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8224675
>>>>
>>>> Webrev: http://cr.openjdk.java.net/~neliasso/8224675/webrev.01/
>>>>
>>>>
>>>> Please review,
>>>>
>>>> Regards,
>>>>
>>>> Nils
>>>>