RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Per Liden per.liden at oracle.com
Tue Sep 10 09:34:14 UTC 2019


Hi Nils,

(Answering for Erik since he's on vacation)

On 9/10/19 11:23 AM, Nils Eliasson wrote:
> Hi,
> 
> The code for propagating the barrier data when cloning has been removed 
> (node.cpp). How is that supposed to be handled now?

Nothing special needs to be done. The barrier data is part of the 
MemNode and LoadStoreNode and those nodes have a size_of(), so it will 
be automatically propagated when doing node cloning.

> 
> And the same thing for mach nodes - mach nodes can also be cloned. Right 
> now it looks like that the barrier_data will be lost.

Same here.

cheers,
Per

> 
> // Nils
> 
> 
> On 2019-09-04 21:10, Nils Eliasson wrote:
>>
>> On 2019-09-04 19:58, Erik Osterlund wrote:
>>> Hi Nils,
>>>
>>>> On 4 Sep 2019, at 19:10, Nils Eliasson <nils.eliasson at oracle.com> 
>>>> wrote:
>>>>
>>>>
>>>>> On 2019-09-04 18:36, Per Liden wrote:
>>>>>> On 9/4/19 5:22 PM, Nils Eliasson wrote:
>>>>>> Hi Erik,
>>>>>>
>>>>>> In hotspot/cpu/x86/gc/z/z_x86_64.ad this pattern is quite common:
>>>>>>
>>>>>> // Load Pointer
>>>>>> instruct ZLoadP(rRegP dst, memory mem, rFlagsReg cr)
>>>>>> %{
>>>>>> predicate(UseZGC && n->as_Load()->barrier_data() == 
>>>>>> ZLoadBarrierStrong);
>>>>>> ...
>>>>>>
>>>>>>      ins_encode %{
>>>>>> __ movptr($dst$$Register, $mem$$Address);
>>>>>> if (barrier_data() != ZLoadBarrierElided) {
>>>>>> z_load_barrier(_masm, this, $mem$$Address, $dst$$Register, noreg 
>>>>>> /* tmp */, false /* weak */);
>>>>>> }
>>>>>>      %}
>>>>>>
>>>>>> If you predicate on "n->as_Load()->barrier_data() == 
>>>>>> ZLoadBarrierStrong" - how can "barrier_data() != 
>>>>>> ZLoadBarrierElided" ever be false? When barrier_data() == 
>>>>>> ZLoadBarrierElided it will have matched the non-Z rule.
>>>>> The context (the "this" pointer) inside ins_encode is 
>>>>> MachNode-level. So, the original LoadNode had a barrier attached to 
>>>>> it, but later the barrier elision optimization elided it on the 
>>>>> MachNode.
>>>> Right, the elision runs during codebuffer init.
>>> Yup.
>>>
>>>> In PhaseX.cpp there is two cases of:
>>>>
>>>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>>>> bool has_load_barrier_nodes = bs->has_load_barrier_nodes();
>>>> ...
>>>> if (has_load_barrier_nodes) {"
>>>>
>>>> I know it was there before, but since you refactored the names. Why 
>>>> not change it to:
>>>>
>>>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>>>> ...
>>>> if (bs->has_load_barrier_nodes()) {"
>>> I’m guessing the author of that code deliberately wanted to perform 
>>> the virtual call outside of the loops instead of inside, which 
>>> doesn’t bother me. Don’t like that?
>>>
>>> Thanks for reviewing!
>>>
>>> /Erik
>>
>> Right, my mind that has folded the facts that the barrierset doesn't 
>> change, and that the has_load_barrier_nodes() is pure. I should check 
>> the generated code before you change it.
>>
>> //N
>>
>>
>>>
>>>> // N
>>>>
>>>>> cheers,
>>>>> Per
>>>>>
>>>>>> // Nils
>>>>>>
>>>>>>
>>>>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> For many years we have expanded load barriers in the C2 sea of 
>>>>>>> nodes. It has been a constant struggle to keep up with bugs due 
>>>>>>> to optimizations breaking our barriers. It has never truly 
>>>>>>> worked. One particular pain point that has never been handled 
>>>>>>> quite right up until now, is dealing with safepoints ending up 
>>>>>>> between a load and its load barrier. We have had workarounds for 
>>>>>>> that before, but they have never really been enough.
>>>>>>>
>>>>>>> In the end, our barrier is only a conditional branch to a slow 
>>>>>>> path, so there is really not much that the optimizer can do to 
>>>>>>> help us make that better. But it has many creative ways of 
>>>>>>> breaking our GC invariants.
>>>>>>>
>>>>>>> I think we have finally had enough of this, and want to move the 
>>>>>>> barrier expansion to the MachNode level instead. This way, we can 
>>>>>>> finally put an end to the load and its load barrier being 
>>>>>>> separated (and related even more complicated issues for atomics).
>>>>>>>
>>>>>>> Our new solution is to tag accesses that want load barriers 
>>>>>>> during parsing, and then let C2 optimize whatever it wants to, 
>>>>>>> invariantly of GC barriers. Then it will match mach nodes, and 
>>>>>>> perform global code motion and scheduling. Only right before 
>>>>>>> dumping the machine code of the resulting graph do we call into 
>>>>>>> the barrier set to perform last second analysis of barriers, and 
>>>>>>> then during machine code dumping, we inject our load barriers. 
>>>>>>> After the normal insts() are dumped, we inject slow path stubs 
>>>>>>> for our barriers.
>>>>>>>
>>>>>>> There are two optimizations that we would like to retain in this 
>>>>>>> scheme.
>>>>>>>
>>>>>>> Optimization 1: Dominating barrier analysis
>>>>>>> Previously, we instantiated a PhaseIdealLoop instance to analyze 
>>>>>>> dominating barriers. That was convenient because a dominator tree 
>>>>>>> is available for finding load barriers that dominate other load 
>>>>>>> barriers in the CFG. I built a new more precise analysis on the 
>>>>>>> PhaseCFG level instead, happening after the matching to mach 
>>>>>>> nodes. The analysis is now looking for dominating accesses, 
>>>>>>> instead of dominating load barriers. Because any dominating 
>>>>>>> access, including stores, will make sure that what is left behind 
>>>>>>> in memory is "good". Another thing that makes the analysis more 
>>>>>>> precise, is that it doesn't require strict dominance in the CFG. 
>>>>>>> If the earlier access is in the same Block as an access with 
>>>>>>> barriers, we now also utilize knowledge about the scheduling of 
>>>>>>> the instructions, which has completed at this point. So we can 
>>>>>>> safely remove such pointless load barriers in the same block now. 
>>>>>>> The analysis is performed right before machine code is emitted, 
>>>>>>> so we can trust that it won't change after the analysis due to 
>>>>>>> optimizations.
>>>>>>>
>>>>>>> Optimization 2: Tight register spilling
>>>>>>> When we call the slow path of our barriers, we want to spill only 
>>>>>>> the registers that are live. Previously, we had a special 
>>>>>>> LoadBarrierSlowReg node corresponding to the slow path, that 
>>>>>>> killed all XMM registers, and then all general purpose registers 
>>>>>>> were called in the slow path. Now we instead perform explicit 
>>>>>>> live analysis of our registers on MachNodes, including how large 
>>>>>>> chunks of vector registers are being used, and spill only exactly 
>>>>>>> the registers that are live (and only the part of the register 
>>>>>>> that is live for XMM/YMM/ZMM registers).
>>>>>>>
>>>>>>> Zooming out a bit, all complexity of pulling the barriers in the 
>>>>>>> sea of nodes through various interesting phases while retaining 
>>>>>>> GC invariants such as "don't put safepoints in my barrier", 
>>>>>>> become trivial and no longer an issue. We simply tag our loads to 
>>>>>>> need barriers, and let C2 do whatever it wants to in the sea of 
>>>>>>> nodes. Once all scheduling is done, we do our thing. Hopefully 
>>>>>>> this will make the barriers as stable and resilient as our C1 
>>>>>>> barriers, which cause trouble extremely rarely.
>>>>>>>
>>>>>>> We have run a number of benchmarks. We have observed a number of 
>>>>>>> improvements, but never any regressions. There has been countless 
>>>>>>> runs through gc-test-suite, and a few hs-tier1-6 and his tier1-7 
>>>>>>> runs.
>>>>>>>
>>>>>>> Finally, I would like to thank Per and StefanK for the many hours 
>>>>>>> spent on helping me with this patch, both in terms of spotting 
>>>>>>> flaws in my prototypes, benchmarking, testing, and refactoring so 
>>>>>>> the code looks nice and much more understandable. I will add both 
>>>>>>> to the Contributed-by line.
>>>>>>>
>>>>>>> @Stuart: It would be awesome if you could provide some AArch64 
>>>>>>> bits for this patch so we do things the same way (ish).
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>>>
>>>>>>> Thanks,
>>>>>>> /Erik
>>


More information about the hotspot-compiler-dev mailing list