RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Wed Sep 4 21:13:24 UTC 2019

On 9/4/19 2:58 PM, Erik Österlund wrote:
> Hi,
> 
> For many years we have expanded load barriers in the C2 sea of nodes. It 
> has been a constant struggle to keep up with bugs due to optimizations 
> breaking our barriers. It has never truly worked. One particular pain 
> point that has never been handled quite right up until now, is dealing 
> with safepoints ending up between a load and its load barrier. We have 
> had workarounds for that before, but they have never really been enough.
> 
> In the end, our barrier is only a conditional branch to a slow path, so 
> there is really not much that the optimizer can do to help us make that 
> better. But it has many creative ways of breaking our GC invariants.
> 
> I think we have finally had enough of this, and want to move the barrier 
> expansion to the MachNode level instead. This way, we can finally put an 
> end to the load and its load barrier being separated (and related even 
> more complicated issues for atomics).
> 
> Our new solution is to tag accesses that want load barriers during 
> parsing, and then let C2 optimize whatever it wants to, invariantly of 
> GC barriers. Then it will match mach nodes, and perform global code 
> motion and scheduling. Only right before dumping the machine code of the 
> resulting graph do we call into the barrier set to perform last second 
> analysis of barriers, and then during machine code dumping, we inject 
> our load barriers. After the normal insts() are dumped, we inject slow 
> path stubs for our barriers.
> 
> There are two optimizations that we would like to retain in this scheme.
> 
> Optimization 1: Dominating barrier analysis
> Previously, we instantiated a PhaseIdealLoop instance to analyze 
> dominating barriers. That was convenient because a dominator tree is 
> available for finding load barriers that dominate other load barriers in 
> the CFG. I built a new more precise analysis on the PhaseCFG level 
> instead, happening after the matching to mach nodes. The analysis is now 
> looking for dominating accesses, instead of dominating load barriers. 
> Because any dominating access, including stores, will make sure that 
> what is left behind in memory is "good". Another thing that makes the 
> analysis more precise, is that it doesn't require strict dominance in 
> the CFG. If the earlier access is in the same Block as an access with 
> barriers, we now also utilize knowledge about the scheduling of the 
> instructions, which has completed at this point. So we can safely remove 
> such pointless load barriers in the same block now. The analysis is 
> performed right before machine code is emitted, so we can trust that it 
> won't change after the analysis due to optimizations.
> 
> Optimization 2: Tight register spilling
> When we call the slow path of our barriers, we want to spill only the 
> registers that are live. Previously, we had a special LoadBarrierSlowReg 
> node corresponding to the slow path, that killed all XMM registers, and 
> then all general purpose registers were called in the slow path. Now we 
> instead perform explicit live analysis of our registers on MachNodes, 
> including how large chunks of vector registers are being used, and spill 
> only exactly the registers that are live (and only the part of the 
> register that is live for XMM/YMM/ZMM registers).
> 
> Zooming out a bit, all complexity of pulling the barriers in the sea of 
> nodes through various interesting phases while retaining GC invariants 
> such as "don't put safepoints in my barrier", become trivial and no 
> longer an issue. We simply tag our loads to need barriers, and let C2 do 
> whatever it wants to in the sea of nodes. Once all scheduling is done, 
> we do our thing. Hopefully this will make the barriers as stable and 
> resilient as our C1 barriers, which cause trouble extremely rarely.
> 
> We have run a number of benchmarks. We have observed a number of 
> improvements, but never any regressions. There has been countless runs 
> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
> 
> Finally, I would like to thank Per and StefanK for the many hours spent 
> on helping me with this patch, both in terms of spotting flaws in my 
> prototypes, benchmarking, testing, and refactoring so the code looks 
> nice and much more understandable. I will add both to the Contributed-by 
> line.
> 
> @Stuart: It would be awesome if you could provide some AArch64 bits for 
> this patch so we do things the same way (ish).
> 
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8230565
> 
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/

As I've been looking at this patch quite a bit over the past few weeks, 
all my comments and suggestions have already been taken care of.

I just noticed one tiny thing:

src/hotspot/share/gc/z/c2/zBarrierSetC2.hpp:

const uint8_t ZLoadBarrierStrong = 1;
const uint8_t ZLoadBarrierWeak   = 2;
const uint8_t ZLoadBarrierElided = 4;

Given that the barrier_data isn't a set of bits/flags (anymore), just 
discrete states, I'd suggest we make these 1, 2, and 3 to avoid giving 
that impression. I.e.:

const uint8_t ZLoadBarrierStrong = 1;
const uint8_t ZLoadBarrierWeak   = 2;
const uint8_t ZLoadBarrierElided = 3;

Other than that, this patch looks good.

I can also add that I've been benchmarking this patch quite a bit, and 
can confirm Erik's statement about performance. I've also done numerous 
passed over gc-test-suite and tier1-6, without seeing any issues.

With this redesign, the way we inject load barriers becomes straight 
forward and inherently robust. No more safepoints sneaking into the 
wrong place, no more broken optimization passes, etc. This will bring 
order to the galaxy^Wbarriers!

cheers,
Per