RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Thu Sep 26 14:26:44 UTC 2019

Hello,
   Yes, I'll try this out - I was in the middle of my own
implementation, but I haven't advanced it as far as Nils.

BR,
 Stuart

On Sat, 21 Sep 2019 at 09:11, Per Liden <per.liden at oracle.com> wrote:
>
> Thanks Nils!
>
> Stuart, could you take this for a spin and help Nils track down the last
> few issues?
>
> We're eager to push this as soon as we feel comfortable that the aarch64
> bits are good enough.
>
> cheers,
> Per
>
> On 9/20/19 4:50 PM, Nils Eliasson wrote:
> > Hi,
> >
> > This is an attempt of porting the redesigned C2 barriers to aarch64.
> >
> > Status
> >
> > * Mostly complete - ad-files should be complete although not fully
> > tested yet
> >
> > * Spilling is implemented but not optimized
> >
> > * Runs a bit, but far from stable.
> >
> >
> > Patch: http://cr.openjdk.java.net/~neliasso/aarch_barriers/
> >
> > Regards,
> >
> > Nils
> >
> >
> >
> > On 2019-09-04 14:58, Erik Österlund wrote:
> >
> >> Hi,
> >>
> >> For many years we have expanded load barriers in the C2 sea of nodes.
> >> It has been a constant struggle to keep up with bugs due to
> >> optimizations breaking our barriers. It has never truly worked. One
> >> particular pain point that has never been handled quite right up until
> >> now, is dealing with safepoints ending up between a load and its load
> >> barrier. We have had workarounds for that before, but they have never
> >> really been enough.
> >>
> >> In the end, our barrier is only a conditional branch to a slow path,
> >> so there is really not much that the optimizer can do to help us make
> >> that better. But it has many creative ways of breaking our GC invariants.
> >>
> >> I think we have finally had enough of this, and want to move the
> >> barrier expansion to the MachNode level instead. This way, we can
> >> finally put an end to the load and its load barrier being separated
> >> (and related even more complicated issues for atomics).
> >>
> >> Our new solution is to tag accesses that want load barriers during
> >> parsing, and then let C2 optimize whatever it wants to, invariantly of
> >> GC barriers. Then it will match mach nodes, and perform global code
> >> motion and scheduling. Only right before dumping the machine code of
> >> the resulting graph do we call into the barrier set to perform last
> >> second analysis of barriers, and then during machine code dumping, we
> >> inject our load barriers. After the normal insts() are dumped, we
> >> inject slow path stubs for our barriers.
> >>
> >> There are two optimizations that we would like to retain in this scheme.
> >>
> >> Optimization 1: Dominating barrier analysis
> >> Previously, we instantiated a PhaseIdealLoop instance to analyze
> >> dominating barriers. That was convenient because a dominator tree is
> >> available for finding load barriers that dominate other load barriers
> >> in the CFG. I built a new more precise analysis on the PhaseCFG level
> >> instead, happening after the matching to mach nodes. The analysis is
> >> now looking for dominating accesses, instead of dominating load
> >> barriers. Because any dominating access, including stores, will make
> >> sure that what is left behind in memory is "good". Another thing that
> >> makes the analysis more precise, is that it doesn't require strict
> >> dominance in the CFG. If the earlier access is in the same Block as an
> >> access with barriers, we now also utilize knowledge about the
> >> scheduling of the instructions, which has completed at this point. So
> >> we can safely remove such pointless load barriers in the same block
> >> now. The analysis is performed right before machine code is emitted,
> >> so we can trust that it won't change after the analysis due to
> >> optimizations.
> >>
> >> Optimization 2: Tight register spilling
> >> When we call the slow path of our barriers, we want to spill only the
> >> registers that are live. Previously, we had a special
> >> LoadBarrierSlowReg node corresponding to the slow path, that killed
> >> all XMM registers, and then all general purpose registers were called
> >> in the slow path. Now we instead perform explicit live analysis of our
> >> registers on MachNodes, including how large chunks of vector registers
> >> are being used, and spill only exactly the registers that are live
> >> (and only the part of the register that is live for XMM/YMM/ZMM
> >> registers).
> >>
> >> Zooming out a bit, all complexity of pulling the barriers in the sea
> >> of nodes through various interesting phases while retaining GC
> >> invariants such as "don't put safepoints in my barrier", become
> >> trivial and no longer an issue. We simply tag our loads to need
> >> barriers, and let C2 do whatever it wants to in the sea of nodes. Once
> >> all scheduling is done, we do our thing. Hopefully this will make the
> >> barriers as stable and resilient as our C1 barriers, which cause
> >> trouble extremely rarely.
> >>
> >> We have run a number of benchmarks. We have observed a number of
> >> improvements, but never any regressions. There has been countless runs
> >> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
> >>
> >> Finally, I would like to thank Per and StefanK for the many hours
> >> spent on helping me with this patch, both in terms of spotting flaws
> >> in my prototypes, benchmarking, testing, and refactoring so the code
> >> looks nice and much more understandable. I will add both to the
> >> Contributed-by line.
> >>
> >> @Stuart: It would be awesome if you could provide some AArch64 bits
> >> for this patch so we do things the same way (ish).
> >>
> >> Bug:
> >> https://bugs.openjdk.java.net/browse/JDK-8230565
> >>
> >> Webrev:
> >> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
> >>
> >> Thanks,
> >> /Erik