RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Thu Sep 5 07:03:48 UTC 2019

Hi Per,

Thanks for the review.

On 2019-09-04 23:13, Per Liden wrote:
> On 9/4/19 2:58 PM, Erik Österlund wrote:
>> Hi,
>>
>> For many years we have expanded load barriers in the C2 sea of nodes. 
>> It has been a constant struggle to keep up with bugs due to 
>> optimizations breaking our barriers. It has never truly worked. One 
>> particular pain point that has never been handled quite right up until 
>> now, is dealing with safepoints ending up between a load and its load 
>> barrier. We have had workarounds for that before, but they have never 
>> really been enough.
>>
>> In the end, our barrier is only a conditional branch to a slow path, 
>> so there is really not much that the optimizer can do to help us make 
>> that better. But it has many creative ways of breaking our GC invariants.
>>
>> I think we have finally had enough of this, and want to move the 
>> barrier expansion to the MachNode level instead. This way, we can 
>> finally put an end to the load and its load barrier being separated 
>> (and related even more complicated issues for atomics).
>>
>> Our new solution is to tag accesses that want load barriers during 
>> parsing, and then let C2 optimize whatever it wants to, invariantly of 
>> GC barriers. Then it will match mach nodes, and perform global code 
>> motion and scheduling. Only right before dumping the machine code of 
>> the resulting graph do we call into the barrier set to perform last 
>> second analysis of barriers, and then during machine code dumping, we 
>> inject our load barriers. After the normal insts() are dumped, we 
>> inject slow path stubs for our barriers.
>>
>> There are two optimizations that we would like to retain in this scheme.
>>
>> Optimization 1: Dominating barrier analysis
>> Previously, we instantiated a PhaseIdealLoop instance to analyze 
>> dominating barriers. That was convenient because a dominator tree is 
>> available for finding load barriers that dominate other load barriers 
>> in the CFG. I built a new more precise analysis on the PhaseCFG level 
>> instead, happening after the matching to mach nodes. The analysis is 
>> now looking for dominating accesses, instead of dominating load 
>> barriers. Because any dominating access, including stores, will make 
>> sure that what is left behind in memory is "good". Another thing that 
>> makes the analysis more precise, is that it doesn't require strict 
>> dominance in the CFG. If the earlier access is in the same Block as an 
>> access with barriers, we now also utilize knowledge about the 
>> scheduling of the instructions, which has completed at this point. So 
>> we can safely remove such pointless load barriers in the same block 
>> now. The analysis is performed right before machine code is emitted, 
>> so we can trust that it won't change after the analysis due to 
>> optimizations.
>>
>> Optimization 2: Tight register spilling
>> When we call the slow path of our barriers, we want to spill only the 
>> registers that are live. Previously, we had a special 
>> LoadBarrierSlowReg node corresponding to the slow path, that killed 
>> all XMM registers, and then all general purpose registers were called 
>> in the slow path. Now we instead perform explicit live analysis of our 
>> registers on MachNodes, including how large chunks of vector registers 
>> are being used, and spill only exactly the registers that are live 
>> (and only the part of the register that is live for XMM/YMM/ZMM 
>> registers).
>>
>> Zooming out a bit, all complexity of pulling the barriers in the sea 
>> of nodes through various interesting phases while retaining GC 
>> invariants such as "don't put safepoints in my barrier", become 
>> trivial and no longer an issue. We simply tag our loads to need 
>> barriers, and let C2 do whatever it wants to in the sea of nodes. Once 
>> all scheduling is done, we do our thing. Hopefully this will make the 
>> barriers as stable and resilient as our C1 barriers, which cause 
>> trouble extremely rarely.
>>
>> We have run a number of benchmarks. We have observed a number of 
>> improvements, but never any regressions. There has been countless runs 
>> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>
>> Finally, I would like to thank Per and StefanK for the many hours 
>> spent on helping me with this patch, both in terms of spotting flaws 
>> in my prototypes, benchmarking, testing, and refactoring so the code 
>> looks nice and much more understandable. I will add both to the 
>> Contributed-by line.
>>
>> @Stuart: It would be awesome if you could provide some AArch64 bits 
>> for this patch so we do things the same way (ish).
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>
>> Webrev:
>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
> 
> As I've been looking at this patch quite a bit over the past few weeks, 
> all my comments and suggestions have already been taken care of.
> 
> I just noticed one tiny thing:
> 
> src/hotspot/share/gc/z/c2/zBarrierSetC2.hpp:
> 
> const uint8_t ZLoadBarrierStrong = 1;
> const uint8_t ZLoadBarrierWeak   = 2;
> const uint8_t ZLoadBarrierElided = 4;
> 
> Given that the barrier_data isn't a set of bits/flags (anymore), just 
> discrete states, I'd suggest we make these 1, 2, and 3 to avoid giving 
> that impression. I.e.:
> 
> const uint8_t ZLoadBarrierStrong = 1;
> const uint8_t ZLoadBarrierWeak   = 2;
> const uint8_t ZLoadBarrierElided = 3;

Sure. Made the 4->3 change in-place.

> Other than that, this patch looks good.

Thanks!

/Erik

> I can also add that I've been benchmarking this patch quite a bit, and 
> can confirm Erik's statement about performance. I've also done numerous 
> passed over gc-test-suite and tier1-6, without seeing any issues.
> 
> With this redesign, the way we inject load barriers becomes straight 
> forward and inherently robust. No more safepoints sneaking into the 
> wrong place, no more broken optimization passes, etc. This will bring 
> order to the galaxy^Wbarriers!
> 
> cheers,
> Per