RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Wed Sep 4 12:58:18 UTC 2019

Hi,

For many years we have expanded load barriers in the C2 sea of nodes. It 
has been a constant struggle to keep up with bugs due to optimizations 
breaking our barriers. It has never truly worked. One particular pain 
point that has never been handled quite right up until now, is dealing 
with safepoints ending up between a load and its load barrier. We have 
had workarounds for that before, but they have never really been enough.

In the end, our barrier is only a conditional branch to a slow path, so 
there is really not much that the optimizer can do to help us make that 
better. But it has many creative ways of breaking our GC invariants.

I think we have finally had enough of this, and want to move the barrier 
expansion to the MachNode level instead. This way, we can finally put an 
end to the load and its load barrier being separated (and related even 
more complicated issues for atomics).

Our new solution is to tag accesses that want load barriers during 
parsing, and then let C2 optimize whatever it wants to, invariantly of 
GC barriers. Then it will match mach nodes, and perform global code 
motion and scheduling. Only right before dumping the machine code of the 
resulting graph do we call into the barrier set to perform last second 
analysis of barriers, and then during machine code dumping, we inject 
our load barriers. After the normal insts() are dumped, we inject slow 
path stubs for our barriers.

There are two optimizations that we would like to retain in this scheme.

Optimization 1: Dominating barrier analysis
Previously, we instantiated a PhaseIdealLoop instance to analyze 
dominating barriers. That was convenient because a dominator tree is 
available for finding load barriers that dominate other load barriers in 
the CFG. I built a new more precise analysis on the PhaseCFG level 
instead, happening after the matching to mach nodes. The analysis is now 
looking for dominating accesses, instead of dominating load barriers. 
Because any dominating access, including stores, will make sure that 
what is left behind in memory is "good". Another thing that makes the 
analysis more precise, is that it doesn't require strict dominance in 
the CFG. If the earlier access is in the same Block as an access with 
barriers, we now also utilize knowledge about the scheduling of the 
instructions, which has completed at this point. So we can safely remove 
such pointless load barriers in the same block now. The analysis is 
performed right before machine code is emitted, so we can trust that it 
won't change after the analysis due to optimizations.

Optimization 2: Tight register spilling
When we call the slow path of our barriers, we want to spill only the 
registers that are live. Previously, we had a special LoadBarrierSlowReg 
node corresponding to the slow path, that killed all XMM registers, and 
then all general purpose registers were called in the slow path. Now we 
instead perform explicit live analysis of our registers on MachNodes, 
including how large chunks of vector registers are being used, and spill 
only exactly the registers that are live (and only the part of the 
register that is live for XMM/YMM/ZMM registers).

Zooming out a bit, all complexity of pulling the barriers in the sea of 
nodes through various interesting phases while retaining GC invariants 
such as "don't put safepoints in my barrier", become trivial and no 
longer an issue. We simply tag our loads to need barriers, and let C2 do 
whatever it wants to in the sea of nodes. Once all scheduling is done, 
we do our thing. Hopefully this will make the barriers as stable and 
resilient as our C1 barriers, which cause trouble extremely rarely.

We have run a number of benchmarks. We have observed a number of 
improvements, but never any regressions. There has been countless runs 
through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.

Finally, I would like to thank Per and StefanK for the many hours spent 
on helping me with this patch, both in terms of spotting flaws in my 
prototypes, benchmarking, testing, and refactoring so the code looks 
nice and much more understandable. I will add both to the Contributed-by 
line.

@Stuart: It would be awesome if you could provide some AArch64 bits for 
this patch so we do things the same way (ish).

Bug:
https://bugs.openjdk.java.net/browse/JDK-8230565

Webrev:
http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/

Thanks,
/Erik