RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Mon Sep 9 11:37:27 UTC 2019

Hi Erik,

To me this looks good, but this is only a best-effort review since I'm 
not a C2 expert.

If this is going to compile on Aarch64 we would need to do something 
about the usage of is_XMMRegister in zBarrierSetC2.cpp.

There's also a type mismatch on this line:
stub_req += bs->estimate_stub_size();

stub_req is an int and estimate_stub_size() returns a size_t.

Thanks,
StefanK

On 2019-09-04 14:58, Erik Österlund wrote:
> Hi,
> 
> For many years we have expanded load barriers in the C2 sea of nodes. It 
> has been a constant struggle to keep up with bugs due to optimizations 
> breaking our barriers. It has never truly worked. One particular pain 
> point that has never been handled quite right up until now, is dealing 
> with safepoints ending up between a load and its load barrier. We have 
> had workarounds for that before, but they have never really been enough.
> 
> In the end, our barrier is only a conditional branch to a slow path, so 
> there is really not much that the optimizer can do to help us make that 
> better. But it has many creative ways of breaking our GC invariants.
> 
> I think we have finally had enough of this, and want to move the barrier 
> expansion to the MachNode level instead. This way, we can finally put an 
> end to the load and its load barrier being separated (and related even 
> more complicated issues for atomics).
> 
> Our new solution is to tag accesses that want load barriers during 
> parsing, and then let C2 optimize whatever it wants to, invariantly of 
> GC barriers. Then it will match mach nodes, and perform global code 
> motion and scheduling. Only right before dumping the machine code of the 
> resulting graph do we call into the barrier set to perform last second 
> analysis of barriers, and then during machine code dumping, we inject 
> our load barriers. After the normal insts() are dumped, we inject slow 
> path stubs for our barriers.
> 
> There are two optimizations that we would like to retain in this scheme.
> 
> Optimization 1: Dominating barrier analysis
> Previously, we instantiated a PhaseIdealLoop instance to analyze 
> dominating barriers. That was convenient because a dominator tree is 
> available for finding load barriers that dominate other load barriers in 
> the CFG. I built a new more precise analysis on the PhaseCFG level 
> instead, happening after the matching to mach nodes. The analysis is now 
> looking for dominating accesses, instead of dominating load barriers. 
> Because any dominating access, including stores, will make sure that 
> what is left behind in memory is "good". Another thing that makes the 
> analysis more precise, is that it doesn't require strict dominance in 
> the CFG. If the earlier access is in the same Block as an access with 
> barriers, we now also utilize knowledge about the scheduling of the 
> instructions, which has completed at this point. So we can safely remove 
> such pointless load barriers in the same block now. The analysis is 
> performed right before machine code is emitted, so we can trust that it 
> won't change after the analysis due to optimizations.
> 
> Optimization 2: Tight register spilling
> When we call the slow path of our barriers, we want to spill only the 
> registers that are live. Previously, we had a special LoadBarrierSlowReg 
> node corresponding to the slow path, that killed all XMM registers, and 
> then all general purpose registers were called in the slow path. Now we 
> instead perform explicit live analysis of our registers on MachNodes, 
> including how large chunks of vector registers are being used, and spill 
> only exactly the registers that are live (and only the part of the 
> register that is live for XMM/YMM/ZMM registers).
> 
> Zooming out a bit, all complexity of pulling the barriers in the sea of 
> nodes through various interesting phases while retaining GC invariants 
> such as "don't put safepoints in my barrier", become trivial and no 
> longer an issue. We simply tag our loads to need barriers, and let C2 do 
> whatever it wants to in the sea of nodes. Once all scheduling is done, 
> we do our thing. Hopefully this will make the barriers as stable and 
> resilient as our C1 barriers, which cause trouble extremely rarely.
> 
> We have run a number of benchmarks. We have observed a number of 
> improvements, but never any regressions. There has been countless runs 
> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
> 
> Finally, I would like to thank Per and StefanK for the many hours spent 
> on helping me with this patch, both in terms of spotting flaws in my 
> prototypes, benchmarking, testing, and refactoring so the code looks 
> nice and much more understandable. I will add both to the Contributed-by 
> line.
> 
> @Stuart: It would be awesome if you could provide some AArch64 bits for 
> this patch so we do things the same way (ish).
> 
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8230565
> 
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
> 
> Thanks,
> /Erik