RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Wed Sep 4 17:10:12 UTC 2019

On 2019-09-04 18:36, Per Liden wrote:
> On 9/4/19 5:22 PM, Nils Eliasson wrote:
>> Hi Erik,
>>
>> In hotspot/cpu/x86/gc/z/z_x86_64.ad this pattern is quite common:
>>
>> // Load Pointer
>> instruct ZLoadP(rRegP dst, memory mem, rFlagsReg cr)
>> %{
>> predicate(UseZGC && n->as_Load()->barrier_data() == ZLoadBarrierStrong);
>> ...
>>
>>     ins_encode %{
>> __ movptr($dst$$Register, $mem$$Address);
>> if (barrier_data() != ZLoadBarrierElided) {
>> z_load_barrier(_masm, this, $mem$$Address, $dst$$Register, noreg /* 
>> tmp */, false /* weak */);
>> }
>>     %}
>>
>> If you predicate on "n->as_Load()->barrier_data() == 
>> ZLoadBarrierStrong" - how can "barrier_data() != ZLoadBarrierElided" 
>> ever be false? When barrier_data() == ZLoadBarrierElided it will have 
>> matched the non-Z rule.
>
> The context (the "this" pointer) inside ins_encode is MachNode-level. 
> So, the original LoadNode had a barrier attached to it, but later the 
> barrier elision optimization elided it on the MachNode.

Right, the elision runs during codebuffer init.

In PhaseX.cpp there is two cases of:

"BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
bool has_load_barrier_nodes = bs->has_load_barrier_nodes();
...
if (has_load_barrier_nodes) {"

I know it was there before, but since you refactored the names. Why not 
change it to:

"BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
...
if (bs->has_load_barrier_nodes()) {"

// N

>
> cheers,
> Per
>
>>
>> // Nils
>>
>>
>> On 2019-09-04 14:58, Erik Österlund wrote:
>>> Hi,
>>>
>>> For many years we have expanded load barriers in the C2 sea of 
>>> nodes. It has been a constant struggle to keep up with bugs due to 
>>> optimizations breaking our barriers. It has never truly worked. One 
>>> particular pain point that has never been handled quite right up 
>>> until now, is dealing with safepoints ending up between a load and 
>>> its load barrier. We have had workarounds for that before, but they 
>>> have never really been enough.
>>>
>>> In the end, our barrier is only a conditional branch to a slow path, 
>>> so there is really not much that the optimizer can do to help us 
>>> make that better. But it has many creative ways of breaking our GC 
>>> invariants.
>>>
>>> I think we have finally had enough of this, and want to move the 
>>> barrier expansion to the MachNode level instead. This way, we can 
>>> finally put an end to the load and its load barrier being separated 
>>> (and related even more complicated issues for atomics).
>>>
>>> Our new solution is to tag accesses that want load barriers during 
>>> parsing, and then let C2 optimize whatever it wants to, invariantly 
>>> of GC barriers. Then it will match mach nodes, and perform global 
>>> code motion and scheduling. Only right before dumping the machine 
>>> code of the resulting graph do we call into the barrier set to 
>>> perform last second analysis of barriers, and then during machine 
>>> code dumping, we inject our load barriers. After the normal insts() 
>>> are dumped, we inject slow path stubs for our barriers.
>>>
>>> There are two optimizations that we would like to retain in this 
>>> scheme.
>>>
>>> Optimization 1: Dominating barrier analysis
>>> Previously, we instantiated a PhaseIdealLoop instance to analyze 
>>> dominating barriers. That was convenient because a dominator tree is 
>>> available for finding load barriers that dominate other load 
>>> barriers in the CFG. I built a new more precise analysis on the 
>>> PhaseCFG level instead, happening after the matching to mach nodes. 
>>> The analysis is now looking for dominating accesses, instead of 
>>> dominating load barriers. Because any dominating access, including 
>>> stores, will make sure that what is left behind in memory is "good". 
>>> Another thing that makes the analysis more precise, is that it 
>>> doesn't require strict dominance in the CFG. If the earlier access 
>>> is in the same Block as an access with barriers, we now also utilize 
>>> knowledge about the scheduling of the instructions, which has 
>>> completed at this point. So we can safely remove such pointless load 
>>> barriers in the same block now. The analysis is performed right 
>>> before machine code is emitted, so we can trust that it won't change 
>>> after the analysis due to optimizations.
>>>
>>> Optimization 2: Tight register spilling
>>> When we call the slow path of our barriers, we want to spill only 
>>> the registers that are live. Previously, we had a special 
>>> LoadBarrierSlowReg node corresponding to the slow path, that killed 
>>> all XMM registers, and then all general purpose registers were 
>>> called in the slow path. Now we instead perform explicit live 
>>> analysis of our registers on MachNodes, including how large chunks 
>>> of vector registers are being used, and spill only exactly the 
>>> registers that are live (and only the part of the register that is 
>>> live for XMM/YMM/ZMM registers).
>>>
>>> Zooming out a bit, all complexity of pulling the barriers in the sea 
>>> of nodes through various interesting phases while retaining GC 
>>> invariants such as "don't put safepoints in my barrier", become 
>>> trivial and no longer an issue. We simply tag our loads to need 
>>> barriers, and let C2 do whatever it wants to in the sea of nodes. 
>>> Once all scheduling is done, we do our thing. Hopefully this will 
>>> make the barriers as stable and resilient as our C1 barriers, which 
>>> cause trouble extremely rarely.
>>>
>>> We have run a number of benchmarks. We have observed a number of 
>>> improvements, but never any regressions. There has been countless 
>>> runs through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>>
>>> Finally, I would like to thank Per and StefanK for the many hours 
>>> spent on helping me with this patch, both in terms of spotting flaws 
>>> in my prototypes, benchmarking, testing, and refactoring so the code 
>>> looks nice and much more understandable. I will add both to the 
>>> Contributed-by line.
>>>
>>> @Stuart: It would be awesome if you could provide some AArch64 bits 
>>> for this patch so we do things the same way (ish).
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>
>>> Thanks,
>>> /Erik