RFR: 8225642: ZGC: Crash due to bad oops being spilled to stack in load barriers
Erik Österlund
erik.osterlund at oracle.com
Thu Jun 20 09:38:43 UTC 2019
Hi,
With the new late barrier expansion mechanism, we thought we could
finally disable some workarounds that fixed up bad oops in the stack,
that should never be allowed to be bad really if the compiler did what
we want it to do.
It turns out that it is still possible for bad oops to end up in the
stack. This is attributed to spills of the loaded bad oop that never
really got usedin the application. These spills could be observed in the
disassembly. While the oop isn't used in the nmethod, it might be used
by e.g. deoptimization, forcing it to stick around. The cause for the
spill causing the trouble is that the medium slow path
(LoadBarrierSlowRegNode) defines (DEF) a new value returned from the
slow path reloads the oop into that register, instead of using the
original oop as input. So the LoadBarrierSlowRegNode would sometimes get
a different register to the original load, causing the loaded bad value
to spill bad values for deopt sometimes, eventually causing crashes in
heap iteration code and asserts.
To solve the problem, the original oop was added as an input edge to
LoadBarrierSlowRegNode, so that we in the AD file could have the input
and output of the node be the same, matching the original load. That
attempt first produced a bad AD file because the ADLC parser doesn't
like that the LoadBarrierSlowRegNode had an additional edge when it is
modeled as a load, which is treated specially in places. It has been
originally modeled as a load because it reloads the value and hence
needs to quack like a load. However, with the new late barrier expansion
approach we can re-model it as a TypeNode instead because the slowreg
node can't really move around so much in the circumstances where they
are now expanded (late and tucked into tight control flow blocks).
Nevertheless, it is a bit iffy and a follow-up RFE will remove the
current practice of returning reloaded values from the barrier slowpaths
to make this increasingly sane. But I want this fix to be as minimal as
possible as it's going in to 13, and the risk of making such changes for
13 is not worth it.
Thanks goes to StefanK, Per and Nils for helping diagnose the problem,
finding good reproducers, helping out testing the fix, and good
discussions along the way.
This patch has made it through mach5 hs-tier1-5, many runs of
gc-testsuite, 100 kitchensink runs, and ~1 day of StefanK's quite
reliable reproducer with custom StefanK verification code that normally
crashes within an hour.
Webrev:
http://cr.openjdk.java.net/~eosterlund/8225642/webrev.00/
Bug:
https://bugs.openjdk.java.net/browse/JDK-8225642
Thanks,
/Erik
More information about the hotspot-compiler-dev
mailing list