RFR: 8225642: ZGC: Crash due to bad oops being spilled to stack in load barriers
Erik Österlund
erik.osterlund at oracle.com
Thu Jun 20 10:02:24 UTC 2019
Hi Nils,
Thanks for the review.
/Erik
On 2019-06-20 11:45, Nils Eliasson wrote:
> Hi Erik,
>
> Looks great!
>
> // Nils
>
> On 2019-06-20 11:38, Erik Österlund wrote:
>> Hi,
>>
>> With the new late barrier expansion mechanism, we thought we could
>> finally disable some workarounds that fixed up bad oops in the stack,
>> that should never be allowed to be bad really if the compiler did
>> what we want it to do.
>> It turns out that it is still possible for bad oops to end up in the
>> stack. This is attributed to spills of the loaded bad oop that never
>> really got usedin the application. These spills could be observed in
>> the disassembly. While the oop isn't used in the nmethod, it might be
>> used by e.g. deoptimization, forcing it to stick around. The cause
>> for the spill causing the trouble is that the medium slow path
>> (LoadBarrierSlowRegNode) defines (DEF) a new value returned from the
>> slow path reloads the oop into that register, instead of using the
>> original oop as input. So the LoadBarrierSlowRegNode would sometimes
>> get a different register to the original load, causing the loaded bad
>> value to spill bad values for deopt sometimes, eventually causing
>> crashes in heap iteration code and asserts.
>>
>> To solve the problem, the original oop was added as an input edge to
>> LoadBarrierSlowRegNode, so that we in the AD file could have the
>> input and output of the node be the same, matching the original load.
>> That attempt first produced a bad AD file because the ADLC parser
>> doesn't like that the LoadBarrierSlowRegNode had an additional edge
>> when it is modeled as a load, which is treated specially in places.
>> It has been originally modeled as a load because it reloads the value
>> and hence needs to quack like a load. However, with the new late
>> barrier expansion approach we can re-model it as a TypeNode instead
>> because the slowreg node can't really move around so much in the
>> circumstances where they are now expanded (late and tucked into tight
>> control flow blocks). Nevertheless, it is a bit iffy and a follow-up
>> RFE will remove the current practice of returning reloaded values
>> from the barrier slowpaths to make this increasingly sane. But I want
>> this fix to be as minimal as possible as it's going in to 13, and the
>> risk of making such changes for 13 is not worth it.
>>
>> Thanks goes to StefanK, Per and Nils for helping diagnose the
>> problem, finding good reproducers, helping out testing the fix, and
>> good discussions along the way.
>>
>> This patch has made it through mach5 hs-tier1-5, many runs of
>> gc-testsuite, 100 kitchensink runs, and ~1 day of StefanK's quite
>> reliable reproducer with custom StefanK verification code that
>> normally crashes within an hour.
>>
>> Webrev:
>> http://cr.openjdk.java.net/~eosterlund/8225642/webrev.00/
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8225642
>>
>> Thanks,
>> /Erik
More information about the hotspot-compiler-dev
mailing list