RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Sat Sep 21 08:11:35 UTC 2019

Thanks Nils!

Stuart, could you take this for a spin and help Nils track down the last 
few issues?

We're eager to push this as soon as we feel comfortable that the aarch64 
bits are good enough.

cheers,
Per

On 9/20/19 4:50 PM, Nils Eliasson wrote:
> Hi,
> 
> This is an attempt of porting the redesigned C2 barriers to aarch64.
> 
> Status
> 
> * Mostly complete - ad-files should be complete although not fully 
> tested yet
> 
> * Spilling is implemented but not optimized
> 
> * Runs a bit, but far from stable.
> 
> 
> Patch: http://cr.openjdk.java.net/~neliasso/aarch_barriers/
> 
> Regards,
> 
> Nils
> 
> 
> 
> On 2019-09-04 14:58, Erik Österlund wrote:
> 
>> Hi,
>>
>> For many years we have expanded load barriers in the C2 sea of nodes. 
>> It has been a constant struggle to keep up with bugs due to 
>> optimizations breaking our barriers. It has never truly worked. One 
>> particular pain point that has never been handled quite right up until 
>> now, is dealing with safepoints ending up between a load and its load 
>> barrier. We have had workarounds for that before, but they have never 
>> really been enough.
>>
>> In the end, our barrier is only a conditional branch to a slow path, 
>> so there is really not much that the optimizer can do to help us make 
>> that better. But it has many creative ways of breaking our GC invariants.
>>
>> I think we have finally had enough of this, and want to move the 
>> barrier expansion to the MachNode level instead. This way, we can 
>> finally put an end to the load and its load barrier being separated 
>> (and related even more complicated issues for atomics).
>>
>> Our new solution is to tag accesses that want load barriers during 
>> parsing, and then let C2 optimize whatever it wants to, invariantly of 
>> GC barriers. Then it will match mach nodes, and perform global code 
>> motion and scheduling. Only right before dumping the machine code of 
>> the resulting graph do we call into the barrier set to perform last 
>> second analysis of barriers, and then during machine code dumping, we 
>> inject our load barriers. After the normal insts() are dumped, we 
>> inject slow path stubs for our barriers.
>>
>> There are two optimizations that we would like to retain in this scheme.
>>
>> Optimization 1: Dominating barrier analysis
>> Previously, we instantiated a PhaseIdealLoop instance to analyze 
>> dominating barriers. That was convenient because a dominator tree is 
>> available for finding load barriers that dominate other load barriers 
>> in the CFG. I built a new more precise analysis on the PhaseCFG level 
>> instead, happening after the matching to mach nodes. The analysis is 
>> now looking for dominating accesses, instead of dominating load 
>> barriers. Because any dominating access, including stores, will make 
>> sure that what is left behind in memory is "good". Another thing that 
>> makes the analysis more precise, is that it doesn't require strict 
>> dominance in the CFG. If the earlier access is in the same Block as an 
>> access with barriers, we now also utilize knowledge about the 
>> scheduling of the instructions, which has completed at this point. So 
>> we can safely remove such pointless load barriers in the same block 
>> now. The analysis is performed right before machine code is emitted, 
>> so we can trust that it won't change after the analysis due to 
>> optimizations.
>>
>> Optimization 2: Tight register spilling
>> When we call the slow path of our barriers, we want to spill only the 
>> registers that are live. Previously, we had a special 
>> LoadBarrierSlowReg node corresponding to the slow path, that killed 
>> all XMM registers, and then all general purpose registers were called 
>> in the slow path. Now we instead perform explicit live analysis of our 
>> registers on MachNodes, including how large chunks of vector registers 
>> are being used, and spill only exactly the registers that are live 
>> (and only the part of the register that is live for XMM/YMM/ZMM 
>> registers).
>>
>> Zooming out a bit, all complexity of pulling the barriers in the sea 
>> of nodes through various interesting phases while retaining GC 
>> invariants such as "don't put safepoints in my barrier", become 
>> trivial and no longer an issue. We simply tag our loads to need 
>> barriers, and let C2 do whatever it wants to in the sea of nodes. Once 
>> all scheduling is done, we do our thing. Hopefully this will make the 
>> barriers as stable and resilient as our C1 barriers, which cause 
>> trouble extremely rarely.
>>
>> We have run a number of benchmarks. We have observed a number of 
>> improvements, but never any regressions. There has been countless runs 
>> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>
>> Finally, I would like to thank Per and StefanK for the many hours 
>> spent on helping me with this patch, both in terms of spotting flaws 
>> in my prototypes, benchmarking, testing, and refactoring so the code 
>> looks nice and much more understandable. I will add both to the 
>> Contributed-by line.
>>
>> @Stuart: It would be awesome if you could provide some AArch64 bits 
>> for this patch so we do things the same way (ish).
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>
>> Webrev:
>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>
>> Thanks,
>> /Erik