RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Wed Sep 4 15:22:58 UTC 2019

Hi Erik,

In hotspot/cpu/x86/gc/z/z_x86_64.ad this pattern is quite common:

// Load Pointer
instruct ZLoadP(rRegP dst, memory mem, rFlagsReg cr)
%{
predicate(UseZGC && n->as_Load()->barrier_data() == ZLoadBarrierStrong);
...

    ins_encode %{
__ movptr($dst$$Register, $mem$$Address);
if (barrier_data() != ZLoadBarrierElided) {
z_load_barrier(_masm, this, $mem$$Address, $dst$$Register, noreg /* tmp 
*/, false /* weak */);
}
    %}

If you predicate on "n->as_Load()->barrier_data() == ZLoadBarrierStrong" 
- how can "barrier_data() != ZLoadBarrierElided" ever be false? When 
barrier_data() == ZLoadBarrierElided it will have matched the non-Z rule.

// Nils

On 2019-09-04 14:58, Erik Österlund wrote:
> Hi,
>
> For many years we have expanded load barriers in the C2 sea of nodes. 
> It has been a constant struggle to keep up with bugs due to 
> optimizations breaking our barriers. It has never truly worked. One 
> particular pain point that has never been handled quite right up until 
> now, is dealing with safepoints ending up between a load and its load 
> barrier. We have had workarounds for that before, but they have never 
> really been enough.
>
> In the end, our barrier is only a conditional branch to a slow path, 
> so there is really not much that the optimizer can do to help us make 
> that better. But it has many creative ways of breaking our GC invariants.
>
> I think we have finally had enough of this, and want to move the 
> barrier expansion to the MachNode level instead. This way, we can 
> finally put an end to the load and its load barrier being separated 
> (and related even more complicated issues for atomics).
>
> Our new solution is to tag accesses that want load barriers during 
> parsing, and then let C2 optimize whatever it wants to, invariantly of 
> GC barriers. Then it will match mach nodes, and perform global code 
> motion and scheduling. Only right before dumping the machine code of 
> the resulting graph do we call into the barrier set to perform last 
> second analysis of barriers, and then during machine code dumping, we 
> inject our load barriers. After the normal insts() are dumped, we 
> inject slow path stubs for our barriers.
>
> There are two optimizations that we would like to retain in this scheme.
>
> Optimization 1: Dominating barrier analysis
> Previously, we instantiated a PhaseIdealLoop instance to analyze 
> dominating barriers. That was convenient because a dominator tree is 
> available for finding load barriers that dominate other load barriers 
> in the CFG. I built a new more precise analysis on the PhaseCFG level 
> instead, happening after the matching to mach nodes. The analysis is 
> now looking for dominating accesses, instead of dominating load 
> barriers. Because any dominating access, including stores, will make 
> sure that what is left behind in memory is "good". Another thing that 
> makes the analysis more precise, is that it doesn't require strict 
> dominance in the CFG. If the earlier access is in the same Block as an 
> access with barriers, we now also utilize knowledge about the 
> scheduling of the instructions, which has completed at this point. So 
> we can safely remove such pointless load barriers in the same block 
> now. The analysis is performed right before machine code is emitted, 
> so we can trust that it won't change after the analysis due to 
> optimizations.
>
> Optimization 2: Tight register spilling
> When we call the slow path of our barriers, we want to spill only the 
> registers that are live. Previously, we had a special 
> LoadBarrierSlowReg node corresponding to the slow path, that killed 
> all XMM registers, and then all general purpose registers were called 
> in the slow path. Now we instead perform explicit live analysis of our 
> registers on MachNodes, including how large chunks of vector registers 
> are being used, and spill only exactly the registers that are live 
> (and only the part of the register that is live for XMM/YMM/ZMM 
> registers).
>
> Zooming out a bit, all complexity of pulling the barriers in the sea 
> of nodes through various interesting phases while retaining GC 
> invariants such as "don't put safepoints in my barrier", become 
> trivial and no longer an issue. We simply tag our loads to need 
> barriers, and let C2 do whatever it wants to in the sea of nodes. Once 
> all scheduling is done, we do our thing. Hopefully this will make the 
> barriers as stable and resilient as our C1 barriers, which cause 
> trouble extremely rarely.
>
> We have run a number of benchmarks. We have observed a number of 
> improvements, but never any regressions. There has been countless runs 
> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>
> Finally, I would like to thank Per and StefanK for the many hours 
> spent on helping me with this patch, both in terms of spotting flaws 
> in my prototypes, benchmarking, testing, and refactoring so the code 
> looks nice and much more understandable. I will add both to the 
> Contributed-by line.
>
> @Stuart: It would be awesome if you could provide some AArch64 bits 
> for this patch so we do things the same way (ish).
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8230565
>
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>
> Thanks,
> /Erik