RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Wed Sep 4 19:10:32 UTC 2019

On 2019-09-04 19:58, Erik Osterlund wrote:
> Hi Nils,
>
>> On 4 Sep 2019, at 19:10, Nils Eliasson <nils.eliasson at oracle.com> wrote:
>>
>>
>>> On 2019-09-04 18:36, Per Liden wrote:
>>>> On 9/4/19 5:22 PM, Nils Eliasson wrote:
>>>> Hi Erik,
>>>>
>>>> In hotspot/cpu/x86/gc/z/z_x86_64.ad this pattern is quite common:
>>>>
>>>> // Load Pointer
>>>> instruct ZLoadP(rRegP dst, memory mem, rFlagsReg cr)
>>>> %{
>>>> predicate(UseZGC && n->as_Load()->barrier_data() == ZLoadBarrierStrong);
>>>> ...
>>>>
>>>>      ins_encode %{
>>>> __ movptr($dst$$Register, $mem$$Address);
>>>> if (barrier_data() != ZLoadBarrierElided) {
>>>> z_load_barrier(_masm, this, $mem$$Address, $dst$$Register, noreg /* tmp */, false /* weak */);
>>>> }
>>>>      %}
>>>>
>>>> If you predicate on "n->as_Load()->barrier_data() == ZLoadBarrierStrong" - how can "barrier_data() != ZLoadBarrierElided" ever be false? When barrier_data() == ZLoadBarrierElided it will have matched the non-Z rule.
>>> The context (the "this" pointer) inside ins_encode is MachNode-level. So, the original LoadNode had a barrier attached to it, but later the barrier elision optimization elided it on the MachNode.
>> Right, the elision runs during codebuffer init.
> Yup.
>
>> In PhaseX.cpp there is two cases of:
>>
>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>> bool has_load_barrier_nodes = bs->has_load_barrier_nodes();
>> ...
>> if (has_load_barrier_nodes) {"
>>
>> I know it was there before, but since you refactored the names. Why not change it to:
>>
>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>> ...
>> if (bs->has_load_barrier_nodes()) {"
> I’m guessing the author of that code deliberately wanted to perform the virtual call outside of the loops instead of inside, which doesn’t bother me. Don’t like that?
>
> Thanks for reviewing!
>
> /Erik

Right, my mind that has folded the facts that the barrierset doesn't 
change, and that the has_load_barrier_nodes() is pure. I should check 
the generated code before you change it.

//N

>
>> // N
>>
>>> cheers,
>>> Per
>>>
>>>> // Nils
>>>>
>>>>
>>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>> Hi,
>>>>>
>>>>> For many years we have expanded load barriers in the C2 sea of nodes. It has been a constant struggle to keep up with bugs due to optimizations breaking our barriers. It has never truly worked. One particular pain point that has never been handled quite right up until now, is dealing with safepoints ending up between a load and its load barrier. We have had workarounds for that before, but they have never really been enough.
>>>>>
>>>>> In the end, our barrier is only a conditional branch to a slow path, so there is really not much that the optimizer can do to help us make that better. But it has many creative ways of breaking our GC invariants.
>>>>>
>>>>> I think we have finally had enough of this, and want to move the barrier expansion to the MachNode level instead. This way, we can finally put an end to the load and its load barrier being separated (and related even more complicated issues for atomics).
>>>>>
>>>>> Our new solution is to tag accesses that want load barriers during parsing, and then let C2 optimize whatever it wants to, invariantly of GC barriers. Then it will match mach nodes, and perform global code motion and scheduling. Only right before dumping the machine code of the resulting graph do we call into the barrier set to perform last second analysis of barriers, and then during machine code dumping, we inject our load barriers. After the normal insts() are dumped, we inject slow path stubs for our barriers.
>>>>>
>>>>> There are two optimizations that we would like to retain in this scheme.
>>>>>
>>>>> Optimization 1: Dominating barrier analysis
>>>>> Previously, we instantiated a PhaseIdealLoop instance to analyze dominating barriers. That was convenient because a dominator tree is available for finding load barriers that dominate other load barriers in the CFG. I built a new more precise analysis on the PhaseCFG level instead, happening after the matching to mach nodes. The analysis is now looking for dominating accesses, instead of dominating load barriers. Because any dominating access, including stores, will make sure that what is left behind in memory is "good". Another thing that makes the analysis more precise, is that it doesn't require strict dominance in the CFG. If the earlier access is in the same Block as an access with barriers, we now also utilize knowledge about the scheduling of the instructions, which has completed at this point. So we can safely remove such pointless load barriers in the same block now. The analysis is performed right before machine code is emitted, so we can trust that it won't change after the analysis due to optimizations.
>>>>>
>>>>> Optimization 2: Tight register spilling
>>>>> When we call the slow path of our barriers, we want to spill only the registers that are live. Previously, we had a special LoadBarrierSlowReg node corresponding to the slow path, that killed all XMM registers, and then all general purpose registers were called in the slow path. Now we instead perform explicit live analysis of our registers on MachNodes, including how large chunks of vector registers are being used, and spill only exactly the registers that are live (and only the part of the register that is live for XMM/YMM/ZMM registers).
>>>>>
>>>>> Zooming out a bit, all complexity of pulling the barriers in the sea of nodes through various interesting phases while retaining GC invariants such as "don't put safepoints in my barrier", become trivial and no longer an issue. We simply tag our loads to need barriers, and let C2 do whatever it wants to in the sea of nodes. Once all scheduling is done, we do our thing. Hopefully this will make the barriers as stable and resilient as our C1 barriers, which cause trouble extremely rarely.
>>>>>
>>>>> We have run a number of benchmarks. We have observed a number of improvements, but never any regressions. There has been countless runs through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>>>>
>>>>> Finally, I would like to thank Per and StefanK for the many hours spent on helping me with this patch, both in terms of spotting flaws in my prototypes, benchmarking, testing, and refactoring so the code looks nice and much more understandable. I will add both to the Contributed-by line.
>>>>>
>>>>> @Stuart: It would be awesome if you could provide some AArch64 bits for this patch so we do things the same way (ish).
>>>>>
>>>>> Bug:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>
>>>>> Webrev:
>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>
>>>>> Thanks,
>>>>> /Erik