RFR: 8227407: ZGC: C2 loads and load barriers can get separated by safepoints

Erik Österlund erik.osterlund at oracle.com
Fri Jul 19 13:21:37 UTC 2019


Hi,

As promised, I wrote some verification code triggered right before code 
gen.

Updated webrev:
http://cr.openjdk.java.net/~eosterlund/8227407/webrev.02/

I verified that this verification code catches the issues I identified 
before my fix is applied, and doesn't trigger after my fix is applied.
This type of verification won't show you where you went wrong, but it 
will show you that there is a problem (somewhere) when it arises.
Managed to get through hs-tier1-6 with this patch (and verification enabled)

Thanks,
/Erik

On 2019-07-19 09:41, Erik Österlund wrote:
> Hi Stuart,
>
> Thank you for taking this for a spin. And yes, I think the appropriate 
> place to put in such verification code would be in
>
> bs->verify_gc_barriers(this, BarrierSetC2::BeforeCodeGen);
>
> which is called right after Compile::Optimize() and before lowering to 
> machine nodes.
> I'll give it a shot to add some verification code there.
>
> Thanks,
> /Erik
>
> On 2019-07-18 17:34, Stuart Monteith wrote:
>> Hello Erik,
>>     I'll try this against Aarch64 and let you know what I get. I'm
>> thinking the safepoints could be examined in the assembler to see if
>> they are splitting the loads and their barriers.
>> Is there a pass where we could enforce this as an invariant?
>>
>> BR,
>>     Stuart
>>
>> On Thu, 18 Jul 2019 at 15:46, Erik Österlund 
>> <erik.osterlund at oracle.com> wrote:
>>> Hi,
>>>
>>> The idea of "8224675: Late GC barrier insertion for ZGC" was to expand
>>> load barriers much later, when loads have stopped floating around. The
>>> reason we wanted to do that is so that when we expand load barriers, we
>>> won't get safepoints between loads and their load barrier.
>>>
>>> Turns out though that we still get safepoints between loads and their
>>> corresponding load barriers. This happens when a load with an expanded
>>> barrier state ends up in the same block as a safepoint.
>>>
>>> Here are reasons why this can happen that I found:
>>>
>>> 1) We don't pin the loads to their block. By that I mean make sure that
>>> it has the block as in(MemNode::Control), as well as have their
>>> _control_dependency == Pinned. Because of that, e.g. final graph
>>> reshaping of CastPP nodes, move their memory (load) uses up to the
>>> CastPP control block. That causes the load to move in to a block that
>>> may have safepoints in it, breaking the invariant. I now explicitly pin
>>> the loads to their block during late barrier injection to prevent such
>>> floating.
>>>
>>> 2) Despite pinning the loads after barrier injection, there is an IGVN
>>> transformation that explicitly moves loads above safepoints, without
>>> respecting if they are pinned. I changed the LoadNode::Ideal 
>>> function to
>>> only move loads past safepoints if they are not pinned, to prevent 
>>> that.
>>> This is the only change affecting non-ZGC code. But it seems to me that
>>> floating a pinned load above a safepoint is illegal in any execution
>>> mode and is not a ZGC-related problem. Moving for example the G1 loads
>>> that checks if marking is active past a safepoint will crash the VM.
>>>
>>> 3) Late barrier expansion isn't quite late enough. Barriers are 
>>> expanded
>>> before macro expansion. Macro nodes such as locking macro nodes may
>>> expand into a bunch of nodes, including safepoints. I moved late 
>>> barrier
>>> expansion to after macro expansion to solve that. It also should be
>>> after macro expansion for yet another reason: macro nodes my expand to
>>> loads that will require barriers. For example the "array copy" basic
>>> clone (i.e. java.lang.Object.clone() intrinsic) node does that. Now we
>>> have cloning disabled in ZGC right now, but should be able to enable it
>>> after this. That will be a later fix though for 14.
>>>
>>> 4) The original block of the load where we inject the LoadBarrier and
>>> splice in a successor control block (which will end up after the
>>> barrier), may already have safepoints in it. I solved this issue by
>>> moving the safepoint node to the successor control block that we
>>> injected as part of barrier injection. This ensures that the load and
>>> the safepoint don't end up in the same block.
>>>
>>> I found these issues by injecting quite intrusive verification code 
>>> into
>>> C2. Unfortunately I think the verification code I wrote is too 
>>> intrusive
>>> to upstream, so I would like to upstream just the fix to 13, and rework
>>> less intrusive verification code for 14 when there is more time to 
>>> think
>>> about how to do that, if that is okay. I've done what I believe is the
>>> minimal fix to the observed problem, with the minimal risk.
>>>
>>> The test that detected the issue originally,
>>> "Compact_NonbranchyTree_ArrayOf", failed with root verification errors.
>>> Probability of failure after Stefan Karlsson's improved verification
>>> code seemed to be about 25% per run. With this patch, I have run 100
>>> iterations without failure. I have also run this through gc-test-suite
>>> with -Xcomp, and I am currently running through tier-1-6.
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8227407
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~eosterlund/8227407/webrev.01/
>>>
>>> Thanks,
>>> /Erik
>



More information about the hotspot-compiler-dev mailing list