RFR: 8225642: ZGC: Crash due to bad oops being spilled to stack in load barriers

Nils Eliasson nils.eliasson at oracle.com
Thu Jun 20 09:45:52 UTC 2019


Hi Erik,

Looks great!

// Nils

On 2019-06-20 11:38, Erik Österlund wrote:
> Hi,
>
> With the new late barrier expansion mechanism, we thought we could 
> finally disable some workarounds that fixed up bad oops in the stack, 
> that should never be allowed to be bad really if the compiler did what 
> we want it to do.
> It turns out that it is still possible for bad oops to end up in the 
> stack. This is attributed to spills of the loaded bad oop that never 
> really got usedin the application. These spills could be observed in 
> the disassembly. While the oop isn't used in the nmethod, it might be 
> used by e.g. deoptimization, forcing it to stick around. The cause for 
> the spill causing the trouble is that the medium slow path 
> (LoadBarrierSlowRegNode) defines (DEF) a new value returned from the 
> slow path reloads the oop into that register, instead of using the 
> original oop as input. So the LoadBarrierSlowRegNode would sometimes 
> get a different register to the original load, causing the loaded bad 
> value to spill bad values for deopt sometimes, eventually causing 
> crashes in heap iteration code and asserts.
>
> To solve the problem, the original oop was added as an input edge to 
> LoadBarrierSlowRegNode, so that we in the AD file could have the input 
> and output of the node be the same, matching the original load. That 
> attempt first produced a bad AD file because the ADLC parser doesn't 
> like that the LoadBarrierSlowRegNode had an additional edge when it is 
> modeled as a load, which is treated specially in places. It has been 
> originally modeled as a load because it reloads the value and hence 
> needs to quack like a load. However, with the new late barrier 
> expansion approach we can re-model it as a TypeNode instead because 
> the slowreg node can't really move around so much in the circumstances 
> where they are now expanded (late and tucked into tight control flow 
> blocks). Nevertheless, it is a bit iffy and a follow-up RFE will 
> remove the current practice of returning reloaded values from the 
> barrier slowpaths to make this increasingly sane. But I want this fix 
> to be as minimal as possible as it's going in to 13, and the risk of 
> making such changes for 13 is not worth it.
>
> Thanks goes to StefanK, Per and Nils for helping diagnose the problem, 
> finding good reproducers, helping out testing the fix, and good 
> discussions along the way.
>
> This patch has made it through mach5 hs-tier1-5, many runs of 
> gc-testsuite, 100 kitchensink runs, and ~1 day of StefanK's quite 
> reliable reproducer with custom StefanK verification code that 
> normally crashes within an hour.
>
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8225642/webrev.00/
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8225642
>
> Thanks,
> /Erik


More information about the hotspot-compiler-dev mailing list