RFR: 8225642: ZGC: Crash due to bad oops being spilled to stack in load barriers

Erik Österlund erik.osterlund at oracle.com
Thu Jun 20 09:38:43 UTC 2019


Hi,

With the new late barrier expansion mechanism, we thought we could 
finally disable some workarounds that fixed up bad oops in the stack, 
that should never be allowed to be bad really if the compiler did what 
we want it to do.
It turns out that it is still possible for bad oops to end up in the 
stack. This is attributed to spills of the loaded bad oop that never 
really got usedin the application. These spills could be observed in the 
disassembly. While the oop isn't used in the nmethod, it might be used 
by e.g. deoptimization, forcing it to stick around. The cause for the 
spill causing the trouble is that the medium slow path 
(LoadBarrierSlowRegNode) defines (DEF) a new value returned from the 
slow path reloads the oop into that register, instead of using the 
original oop as input. So the LoadBarrierSlowRegNode would sometimes get 
a different register to the original load, causing the loaded bad value 
to spill bad values for deopt sometimes, eventually causing crashes in 
heap iteration code and asserts.

To solve the problem, the original oop was added as an input edge to 
LoadBarrierSlowRegNode, so that we in the AD file could have the input 
and output of the node be the same, matching the original load. That 
attempt first produced a bad AD file because the ADLC parser doesn't 
like that the LoadBarrierSlowRegNode had an additional edge when it is 
modeled as a load, which is treated specially in places. It has been 
originally modeled as a load because it reloads the value and hence 
needs to quack like a load. However, with the new late barrier expansion 
approach we can re-model it as a TypeNode instead because the slowreg 
node can't really move around so much in the circumstances where they 
are now expanded (late and tucked into tight control flow blocks). 
Nevertheless, it is a bit iffy and a follow-up RFE will remove the 
current practice of returning reloaded values from the barrier slowpaths 
to make this increasingly sane. But I want this fix to be as minimal as 
possible as it's going in to 13, and the risk of making such changes for 
13 is not worth it.

Thanks goes to StefanK, Per and Nils for helping diagnose the problem, 
finding good reproducers, helping out testing the fix, and good 
discussions along the way.

This patch has made it through mach5 hs-tier1-5, many runs of 
gc-testsuite, 100 kitchensink runs, and ~1 day of StefanK's quite 
reliable reproducer with custom StefanK verification code that normally 
crashes within an hour.

Webrev:
http://cr.openjdk.java.net/~eosterlund/8225642/webrev.00/

Bug:
https://bugs.openjdk.java.net/browse/JDK-8225642

Thanks,
/Erik


More information about the hotspot-compiler-dev mailing list