RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Tue Sep 10 09:55:55 UTC 2019

Yes, that should solve it.

Great work!

Reviewed.

// Nils

On 2019-09-10 11:34, Per Liden wrote:
> Hi Nils,
>
> (Answering for Erik since he's on vacation)
>
> On 9/10/19 11:23 AM, Nils Eliasson wrote:
>> Hi,
>>
>> The code for propagating the barrier data when cloning has been 
>> removed (node.cpp). How is that supposed to be handled now?
>
> Nothing special needs to be done. The barrier data is part of the 
> MemNode and LoadStoreNode and those nodes have a size_of(), so it will 
> be automatically propagated when doing node cloning.
>
>>
>> And the same thing for mach nodes - mach nodes can also be cloned. 
>> Right now it looks like that the barrier_data will be lost.
>
> Same here.
>
> cheers,
> Per
>
>>
>> // Nils
>>
>>
>> On 2019-09-04 21:10, Nils Eliasson wrote:
>>>
>>> On 2019-09-04 19:58, Erik Osterlund wrote:
>>>> Hi Nils,
>>>>
>>>>> On 4 Sep 2019, at 19:10, Nils Eliasson <nils.eliasson at oracle.com> 
>>>>> wrote:
>>>>>
>>>>>
>>>>>> On 2019-09-04 18:36, Per Liden wrote:
>>>>>>> On 9/4/19 5:22 PM, Nils Eliasson wrote:
>>>>>>> Hi Erik,
>>>>>>>
>>>>>>> In hotspot/cpu/x86/gc/z/z_x86_64.ad this pattern is quite common:
>>>>>>>
>>>>>>> // Load Pointer
>>>>>>> instruct ZLoadP(rRegP dst, memory mem, rFlagsReg cr)
>>>>>>> %{
>>>>>>> predicate(UseZGC && n->as_Load()->barrier_data() == 
>>>>>>> ZLoadBarrierStrong);
>>>>>>> ...
>>>>>>>
>>>>>>>      ins_encode %{
>>>>>>> __ movptr($dst$$Register, $mem$$Address);
>>>>>>> if (barrier_data() != ZLoadBarrierElided) {
>>>>>>> z_load_barrier(_masm, this, $mem$$Address, $dst$$Register, noreg 
>>>>>>> /* tmp */, false /* weak */);
>>>>>>> }
>>>>>>>      %}
>>>>>>>
>>>>>>> If you predicate on "n->as_Load()->barrier_data() == 
>>>>>>> ZLoadBarrierStrong" - how can "barrier_data() != 
>>>>>>> ZLoadBarrierElided" ever be false? When barrier_data() == 
>>>>>>> ZLoadBarrierElided it will have matched the non-Z rule.
>>>>>> The context (the "this" pointer) inside ins_encode is 
>>>>>> MachNode-level. So, the original LoadNode had a barrier attached 
>>>>>> to it, but later the barrier elision optimization elided it on 
>>>>>> the MachNode.
>>>>> Right, the elision runs during codebuffer init.
>>>> Yup.
>>>>
>>>>> In PhaseX.cpp there is two cases of:
>>>>>
>>>>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>>>>> bool has_load_barrier_nodes = bs->has_load_barrier_nodes();
>>>>> ...
>>>>> if (has_load_barrier_nodes) {"
>>>>>
>>>>> I know it was there before, but since you refactored the names. 
>>>>> Why not change it to:
>>>>>
>>>>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>>>>> ...
>>>>> if (bs->has_load_barrier_nodes()) {"
>>>> I’m guessing the author of that code deliberately wanted to perform 
>>>> the virtual call outside of the loops instead of inside, which 
>>>> doesn’t bother me. Don’t like that?
>>>>
>>>> Thanks for reviewing!
>>>>
>>>> /Erik
>>>
>>> Right, my mind that has folded the facts that the barrierset doesn't 
>>> change, and that the has_load_barrier_nodes() is pure. I should 
>>> check the generated code before you change it.
>>>
>>> //N
>>>
>>>
>>>>
>>>>> // N
>>>>>
>>>>>> cheers,
>>>>>> Per
>>>>>>
>>>>>>> // Nils
>>>>>>>
>>>>>>>
>>>>>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> For many years we have expanded load barriers in the C2 sea of 
>>>>>>>> nodes. It has been a constant struggle to keep up with bugs due 
>>>>>>>> to optimizations breaking our barriers. It has never truly 
>>>>>>>> worked. One particular pain point that has never been handled 
>>>>>>>> quite right up until now, is dealing with safepoints ending up 
>>>>>>>> between a load and its load barrier. We have had workarounds 
>>>>>>>> for that before, but they have never really been enough.
>>>>>>>>
>>>>>>>> In the end, our barrier is only a conditional branch to a slow 
>>>>>>>> path, so there is really not much that the optimizer can do to 
>>>>>>>> help us make that better. But it has many creative ways of 
>>>>>>>> breaking our GC invariants.
>>>>>>>>
>>>>>>>> I think we have finally had enough of this, and want to move 
>>>>>>>> the barrier expansion to the MachNode level instead. This way, 
>>>>>>>> we can finally put an end to the load and its load barrier 
>>>>>>>> being separated (and related even more complicated issues for 
>>>>>>>> atomics).
>>>>>>>>
>>>>>>>> Our new solution is to tag accesses that want load barriers 
>>>>>>>> during parsing, and then let C2 optimize whatever it wants to, 
>>>>>>>> invariantly of GC barriers. Then it will match mach nodes, and 
>>>>>>>> perform global code motion and scheduling. Only right before 
>>>>>>>> dumping the machine code of the resulting graph do we call into 
>>>>>>>> the barrier set to perform last second analysis of barriers, 
>>>>>>>> and then during machine code dumping, we inject our load 
>>>>>>>> barriers. After the normal insts() are dumped, we inject slow 
>>>>>>>> path stubs for our barriers.
>>>>>>>>
>>>>>>>> There are two optimizations that we would like to retain in 
>>>>>>>> this scheme.
>>>>>>>>
>>>>>>>> Optimization 1: Dominating barrier analysis
>>>>>>>> Previously, we instantiated a PhaseIdealLoop instance to 
>>>>>>>> analyze dominating barriers. That was convenient because a 
>>>>>>>> dominator tree is available for finding load barriers that 
>>>>>>>> dominate other load barriers in the CFG. I built a new more 
>>>>>>>> precise analysis on the PhaseCFG level instead, happening after 
>>>>>>>> the matching to mach nodes. The analysis is now looking for 
>>>>>>>> dominating accesses, instead of dominating load barriers. 
>>>>>>>> Because any dominating access, including stores, will make sure 
>>>>>>>> that what is left behind in memory is "good". Another thing 
>>>>>>>> that makes the analysis more precise, is that it doesn't 
>>>>>>>> require strict dominance in the CFG. If the earlier access is 
>>>>>>>> in the same Block as an access with barriers, we now also 
>>>>>>>> utilize knowledge about the scheduling of the instructions, 
>>>>>>>> which has completed at this point. So we can safely remove such 
>>>>>>>> pointless load barriers in the same block now. The analysis is 
>>>>>>>> performed right before machine code is emitted, so we can trust 
>>>>>>>> that it won't change after the analysis due to optimizations.
>>>>>>>>
>>>>>>>> Optimization 2: Tight register spilling
>>>>>>>> When we call the slow path of our barriers, we want to spill 
>>>>>>>> only the registers that are live. Previously, we had a special 
>>>>>>>> LoadBarrierSlowReg node corresponding to the slow path, that 
>>>>>>>> killed all XMM registers, and then all general purpose 
>>>>>>>> registers were called in the slow path. Now we instead perform 
>>>>>>>> explicit live analysis of our registers on MachNodes, including 
>>>>>>>> how large chunks of vector registers are being used, and spill 
>>>>>>>> only exactly the registers that are live (and only the part of 
>>>>>>>> the register that is live for XMM/YMM/ZMM registers).
>>>>>>>>
>>>>>>>> Zooming out a bit, all complexity of pulling the barriers in 
>>>>>>>> the sea of nodes through various interesting phases while 
>>>>>>>> retaining GC invariants such as "don't put safepoints in my 
>>>>>>>> barrier", become trivial and no longer an issue. We simply tag 
>>>>>>>> our loads to need barriers, and let C2 do whatever it wants to 
>>>>>>>> in the sea of nodes. Once all scheduling is done, we do our 
>>>>>>>> thing. Hopefully this will make the barriers as stable and 
>>>>>>>> resilient as our C1 barriers, which cause trouble extremely 
>>>>>>>> rarely.
>>>>>>>>
>>>>>>>> We have run a number of benchmarks. We have observed a number 
>>>>>>>> of improvements, but never any regressions. There has been 
>>>>>>>> countless runs through gc-test-suite, and a few hs-tier1-6 and 
>>>>>>>> his tier1-7 runs.
>>>>>>>>
>>>>>>>> Finally, I would like to thank Per and StefanK for the many 
>>>>>>>> hours spent on helping me with this patch, both in terms of 
>>>>>>>> spotting flaws in my prototypes, benchmarking, testing, and 
>>>>>>>> refactoring so the code looks nice and much more 
>>>>>>>> understandable. I will add both to the Contributed-by line.
>>>>>>>>
>>>>>>>> @Stuart: It would be awesome if you could provide some AArch64 
>>>>>>>> bits for this patch so we do things the same way (ish).
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> /Erik
>>>