RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level
Erik Österlund
erik.osterlund at oracle.com
Wed Sep 4 12:58:18 UTC 2019
Hi,
For many years we have expanded load barriers in the C2 sea of nodes. It
has been a constant struggle to keep up with bugs due to optimizations
breaking our barriers. It has never truly worked. One particular pain
point that has never been handled quite right up until now, is dealing
with safepoints ending up between a load and its load barrier. We have
had workarounds for that before, but they have never really been enough.
In the end, our barrier is only a conditional branch to a slow path, so
there is really not much that the optimizer can do to help us make that
better. But it has many creative ways of breaking our GC invariants.
I think we have finally had enough of this, and want to move the barrier
expansion to the MachNode level instead. This way, we can finally put an
end to the load and its load barrier being separated (and related even
more complicated issues for atomics).
Our new solution is to tag accesses that want load barriers during
parsing, and then let C2 optimize whatever it wants to, invariantly of
GC barriers. Then it will match mach nodes, and perform global code
motion and scheduling. Only right before dumping the machine code of the
resulting graph do we call into the barrier set to perform last second
analysis of barriers, and then during machine code dumping, we inject
our load barriers. After the normal insts() are dumped, we inject slow
path stubs for our barriers.
There are two optimizations that we would like to retain in this scheme.
Optimization 1: Dominating barrier analysis
Previously, we instantiated a PhaseIdealLoop instance to analyze
dominating barriers. That was convenient because a dominator tree is
available for finding load barriers that dominate other load barriers in
the CFG. I built a new more precise analysis on the PhaseCFG level
instead, happening after the matching to mach nodes. The analysis is now
looking for dominating accesses, instead of dominating load barriers.
Because any dominating access, including stores, will make sure that
what is left behind in memory is "good". Another thing that makes the
analysis more precise, is that it doesn't require strict dominance in
the CFG. If the earlier access is in the same Block as an access with
barriers, we now also utilize knowledge about the scheduling of the
instructions, which has completed at this point. So we can safely remove
such pointless load barriers in the same block now. The analysis is
performed right before machine code is emitted, so we can trust that it
won't change after the analysis due to optimizations.
Optimization 2: Tight register spilling
When we call the slow path of our barriers, we want to spill only the
registers that are live. Previously, we had a special LoadBarrierSlowReg
node corresponding to the slow path, that killed all XMM registers, and
then all general purpose registers were called in the slow path. Now we
instead perform explicit live analysis of our registers on MachNodes,
including how large chunks of vector registers are being used, and spill
only exactly the registers that are live (and only the part of the
register that is live for XMM/YMM/ZMM registers).
Zooming out a bit, all complexity of pulling the barriers in the sea of
nodes through various interesting phases while retaining GC invariants
such as "don't put safepoints in my barrier", become trivial and no
longer an issue. We simply tag our loads to need barriers, and let C2 do
whatever it wants to in the sea of nodes. Once all scheduling is done,
we do our thing. Hopefully this will make the barriers as stable and
resilient as our C1 barriers, which cause trouble extremely rarely.
We have run a number of benchmarks. We have observed a number of
improvements, but never any regressions. There has been countless runs
through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
Finally, I would like to thank Per and StefanK for the many hours spent
on helping me with this patch, both in terms of spotting flaws in my
prototypes, benchmarking, testing, and refactoring so the code looks
nice and much more understandable. I will add both to the Contributed-by
line.
@Stuart: It would be awesome if you could provide some AArch64 bits for
this patch so we do things the same way (ish).
Bug:
https://bugs.openjdk.java.net/browse/JDK-8230565
Webrev:
http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
Thanks,
/Erik
More information about the hotspot-compiler-dev
mailing list