RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level
Nils Eliasson
nils.eliasson at oracle.com
Tue Sep 10 09:55:55 UTC 2019
Yes, that should solve it.
Great work!
Reviewed.
// Nils
On 2019-09-10 11:34, Per Liden wrote:
> Hi Nils,
>
> (Answering for Erik since he's on vacation)
>
> On 9/10/19 11:23 AM, Nils Eliasson wrote:
>> Hi,
>>
>> The code for propagating the barrier data when cloning has been
>> removed (node.cpp). How is that supposed to be handled now?
>
> Nothing special needs to be done. The barrier data is part of the
> MemNode and LoadStoreNode and those nodes have a size_of(), so it will
> be automatically propagated when doing node cloning.
>
>>
>> And the same thing for mach nodes - mach nodes can also be cloned.
>> Right now it looks like that the barrier_data will be lost.
>
> Same here.
>
> cheers,
> Per
>
>>
>> // Nils
>>
>>
>> On 2019-09-04 21:10, Nils Eliasson wrote:
>>>
>>> On 2019-09-04 19:58, Erik Osterlund wrote:
>>>> Hi Nils,
>>>>
>>>>> On 4 Sep 2019, at 19:10, Nils Eliasson <nils.eliasson at oracle.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> On 2019-09-04 18:36, Per Liden wrote:
>>>>>>> On 9/4/19 5:22 PM, Nils Eliasson wrote:
>>>>>>> Hi Erik,
>>>>>>>
>>>>>>> In hotspot/cpu/x86/gc/z/z_x86_64.ad this pattern is quite common:
>>>>>>>
>>>>>>> // Load Pointer
>>>>>>> instruct ZLoadP(rRegP dst, memory mem, rFlagsReg cr)
>>>>>>> %{
>>>>>>> predicate(UseZGC && n->as_Load()->barrier_data() ==
>>>>>>> ZLoadBarrierStrong);
>>>>>>> ...
>>>>>>>
>>>>>>> ins_encode %{
>>>>>>> __ movptr($dst$$Register, $mem$$Address);
>>>>>>> if (barrier_data() != ZLoadBarrierElided) {
>>>>>>> z_load_barrier(_masm, this, $mem$$Address, $dst$$Register, noreg
>>>>>>> /* tmp */, false /* weak */);
>>>>>>> }
>>>>>>> %}
>>>>>>>
>>>>>>> If you predicate on "n->as_Load()->barrier_data() ==
>>>>>>> ZLoadBarrierStrong" - how can "barrier_data() !=
>>>>>>> ZLoadBarrierElided" ever be false? When barrier_data() ==
>>>>>>> ZLoadBarrierElided it will have matched the non-Z rule.
>>>>>> The context (the "this" pointer) inside ins_encode is
>>>>>> MachNode-level. So, the original LoadNode had a barrier attached
>>>>>> to it, but later the barrier elision optimization elided it on
>>>>>> the MachNode.
>>>>> Right, the elision runs during codebuffer init.
>>>> Yup.
>>>>
>>>>> In PhaseX.cpp there is two cases of:
>>>>>
>>>>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>>>>> bool has_load_barrier_nodes = bs->has_load_barrier_nodes();
>>>>> ...
>>>>> if (has_load_barrier_nodes) {"
>>>>>
>>>>> I know it was there before, but since you refactored the names.
>>>>> Why not change it to:
>>>>>
>>>>> "BarrierSetC2* bs = BarrierSet::barrier_set()->barrier_set_c2();
>>>>> ...
>>>>> if (bs->has_load_barrier_nodes()) {"
>>>> I’m guessing the author of that code deliberately wanted to perform
>>>> the virtual call outside of the loops instead of inside, which
>>>> doesn’t bother me. Don’t like that?
>>>>
>>>> Thanks for reviewing!
>>>>
>>>> /Erik
>>>
>>> Right, my mind that has folded the facts that the barrierset doesn't
>>> change, and that the has_load_barrier_nodes() is pure. I should
>>> check the generated code before you change it.
>>>
>>> //N
>>>
>>>
>>>>
>>>>> // N
>>>>>
>>>>>> cheers,
>>>>>> Per
>>>>>>
>>>>>>> // Nils
>>>>>>>
>>>>>>>
>>>>>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> For many years we have expanded load barriers in the C2 sea of
>>>>>>>> nodes. It has been a constant struggle to keep up with bugs due
>>>>>>>> to optimizations breaking our barriers. It has never truly
>>>>>>>> worked. One particular pain point that has never been handled
>>>>>>>> quite right up until now, is dealing with safepoints ending up
>>>>>>>> between a load and its load barrier. We have had workarounds
>>>>>>>> for that before, but they have never really been enough.
>>>>>>>>
>>>>>>>> In the end, our barrier is only a conditional branch to a slow
>>>>>>>> path, so there is really not much that the optimizer can do to
>>>>>>>> help us make that better. But it has many creative ways of
>>>>>>>> breaking our GC invariants.
>>>>>>>>
>>>>>>>> I think we have finally had enough of this, and want to move
>>>>>>>> the barrier expansion to the MachNode level instead. This way,
>>>>>>>> we can finally put an end to the load and its load barrier
>>>>>>>> being separated (and related even more complicated issues for
>>>>>>>> atomics).
>>>>>>>>
>>>>>>>> Our new solution is to tag accesses that want load barriers
>>>>>>>> during parsing, and then let C2 optimize whatever it wants to,
>>>>>>>> invariantly of GC barriers. Then it will match mach nodes, and
>>>>>>>> perform global code motion and scheduling. Only right before
>>>>>>>> dumping the machine code of the resulting graph do we call into
>>>>>>>> the barrier set to perform last second analysis of barriers,
>>>>>>>> and then during machine code dumping, we inject our load
>>>>>>>> barriers. After the normal insts() are dumped, we inject slow
>>>>>>>> path stubs for our barriers.
>>>>>>>>
>>>>>>>> There are two optimizations that we would like to retain in
>>>>>>>> this scheme.
>>>>>>>>
>>>>>>>> Optimization 1: Dominating barrier analysis
>>>>>>>> Previously, we instantiated a PhaseIdealLoop instance to
>>>>>>>> analyze dominating barriers. That was convenient because a
>>>>>>>> dominator tree is available for finding load barriers that
>>>>>>>> dominate other load barriers in the CFG. I built a new more
>>>>>>>> precise analysis on the PhaseCFG level instead, happening after
>>>>>>>> the matching to mach nodes. The analysis is now looking for
>>>>>>>> dominating accesses, instead of dominating load barriers.
>>>>>>>> Because any dominating access, including stores, will make sure
>>>>>>>> that what is left behind in memory is "good". Another thing
>>>>>>>> that makes the analysis more precise, is that it doesn't
>>>>>>>> require strict dominance in the CFG. If the earlier access is
>>>>>>>> in the same Block as an access with barriers, we now also
>>>>>>>> utilize knowledge about the scheduling of the instructions,
>>>>>>>> which has completed at this point. So we can safely remove such
>>>>>>>> pointless load barriers in the same block now. The analysis is
>>>>>>>> performed right before machine code is emitted, so we can trust
>>>>>>>> that it won't change after the analysis due to optimizations.
>>>>>>>>
>>>>>>>> Optimization 2: Tight register spilling
>>>>>>>> When we call the slow path of our barriers, we want to spill
>>>>>>>> only the registers that are live. Previously, we had a special
>>>>>>>> LoadBarrierSlowReg node corresponding to the slow path, that
>>>>>>>> killed all XMM registers, and then all general purpose
>>>>>>>> registers were called in the slow path. Now we instead perform
>>>>>>>> explicit live analysis of our registers on MachNodes, including
>>>>>>>> how large chunks of vector registers are being used, and spill
>>>>>>>> only exactly the registers that are live (and only the part of
>>>>>>>> the register that is live for XMM/YMM/ZMM registers).
>>>>>>>>
>>>>>>>> Zooming out a bit, all complexity of pulling the barriers in
>>>>>>>> the sea of nodes through various interesting phases while
>>>>>>>> retaining GC invariants such as "don't put safepoints in my
>>>>>>>> barrier", become trivial and no longer an issue. We simply tag
>>>>>>>> our loads to need barriers, and let C2 do whatever it wants to
>>>>>>>> in the sea of nodes. Once all scheduling is done, we do our
>>>>>>>> thing. Hopefully this will make the barriers as stable and
>>>>>>>> resilient as our C1 barriers, which cause trouble extremely
>>>>>>>> rarely.
>>>>>>>>
>>>>>>>> We have run a number of benchmarks. We have observed a number
>>>>>>>> of improvements, but never any regressions. There has been
>>>>>>>> countless runs through gc-test-suite, and a few hs-tier1-6 and
>>>>>>>> his tier1-7 runs.
>>>>>>>>
>>>>>>>> Finally, I would like to thank Per and StefanK for the many
>>>>>>>> hours spent on helping me with this patch, both in terms of
>>>>>>>> spotting flaws in my prototypes, benchmarking, testing, and
>>>>>>>> refactoring so the code looks nice and much more
>>>>>>>> understandable. I will add both to the Contributed-by line.
>>>>>>>>
>>>>>>>> @Stuart: It would be awesome if you could provide some AArch64
>>>>>>>> bits for this patch so we do things the same way (ish).
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> /Erik
>>>
More information about the hotspot-compiler-dev
mailing list