RFR: 8225642: ZGC: Crash due to bad oops being spilled to stack in load barriers

Erik Österlund erik.osterlund at oracle.com
Thu Jun 20 10:02:24 UTC 2019


Hi Nils,

Thanks for the review.

/Erik

On 2019-06-20 11:45, Nils Eliasson wrote:
> Hi Erik,
>
> Looks great!
>
> // Nils
>
> On 2019-06-20 11:38, Erik Österlund wrote:
>> Hi,
>>
>> With the new late barrier expansion mechanism, we thought we could 
>> finally disable some workarounds that fixed up bad oops in the stack, 
>> that should never be allowed to be bad really if the compiler did 
>> what we want it to do.
>> It turns out that it is still possible for bad oops to end up in the 
>> stack. This is attributed to spills of the loaded bad oop that never 
>> really got usedin the application. These spills could be observed in 
>> the disassembly. While the oop isn't used in the nmethod, it might be 
>> used by e.g. deoptimization, forcing it to stick around. The cause 
>> for the spill causing the trouble is that the medium slow path 
>> (LoadBarrierSlowRegNode) defines (DEF) a new value returned from the 
>> slow path reloads the oop into that register, instead of using the 
>> original oop as input. So the LoadBarrierSlowRegNode would sometimes 
>> get a different register to the original load, causing the loaded bad 
>> value to spill bad values for deopt sometimes, eventually causing 
>> crashes in heap iteration code and asserts.
>>
>> To solve the problem, the original oop was added as an input edge to 
>> LoadBarrierSlowRegNode, so that we in the AD file could have the 
>> input and output of the node be the same, matching the original load. 
>> That attempt first produced a bad AD file because the ADLC parser 
>> doesn't like that the LoadBarrierSlowRegNode had an additional edge 
>> when it is modeled as a load, which is treated specially in places. 
>> It has been originally modeled as a load because it reloads the value 
>> and hence needs to quack like a load. However, with the new late 
>> barrier expansion approach we can re-model it as a TypeNode instead 
>> because the slowreg node can't really move around so much in the 
>> circumstances where they are now expanded (late and tucked into tight 
>> control flow blocks). Nevertheless, it is a bit iffy and a follow-up 
>> RFE will remove the current practice of returning reloaded values 
>> from the barrier slowpaths to make this increasingly sane. But I want 
>> this fix to be as minimal as possible as it's going in to 13, and the 
>> risk of making such changes for 13 is not worth it.
>>
>> Thanks goes to StefanK, Per and Nils for helping diagnose the 
>> problem, finding good reproducers, helping out testing the fix, and 
>> good discussions along the way.
>>
>> This patch has made it through mach5 hs-tier1-5, many runs of 
>> gc-testsuite, 100 kitchensink runs, and ~1 day of StefanK's quite 
>> reliable reproducer with custom StefanK verification code that 
>> normally crashes within an hour.
>>
>> Webrev:
>> http://cr.openjdk.java.net/~eosterlund/8225642/webrev.00/
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8225642
>>
>> Thanks,
>> /Erik



More information about the hotspot-compiler-dev mailing list