RFR: 8240693: Sweeper should not examine dying metadata in is_unloading() nmethod during static call stub cleaning

Erik Österlund erik.osterlund at oracle.com
Tue Apr 7 16:30:00 UTC 2020


Hi Per,

Thanks for the review.

/Erik

> On 7 Apr 2020, at 12:46, Per Liden <per.liden at oracle.com> wrote:
> 
> Looks good!
> 
> /Per
> 
>> On 4/2/20 6:28 PM, Erik Österlund wrote:
>> Hi,
>> There is currently an unfortunate race condition between concurrent code cache unloading and the Sweeper
>> (affecting ZGC and Shenandoah).
>> First of all, big thanks to Nils for the initial analysis, requiring downloading a 90 GB core file. Ouch.
>> He discovered that we crashed accessing a freed metadata relocation in an unloaded nmethod from the sweeper.
>> I am fairly certain it is due to the following race condition that can indeed happen today:
>> 1. GC terminates marking in a safepoint
>> 2. GC performs concurrent code cache unlinking. In this step, inline caches are cleaned for is_alive() &&
>>    !is_unloading() nmethods. The key is that it is not performed for is_alive() && is_unloading() nmethods.
>>    So metadata relocations of their static/opt virtual call stubs remain in the nmethod, and could potentially
>>    be dying.
>> 3. A handshake operation is issued to separate unlinking from purging. The sweeper replies to the handshake.
>> 4. GC performs concurrent code cache purging, making all nmethods unloaded. But just before one nmethod is unloaded...
>> 5. The sweeper starts processing the nmethod. It is is_alive() && is_unloading() and the sweeper correctly
>>    decides to clean its inline caches. As part of that inline cache cleaning, it examines if the metadata
>>    relocations of static/opt virtual call stubs need clearing (conditional of the holder of the target method being unloading).
>> 6. Just before examining the metadata relocation, the sweeper gets preempted.
>> 7. The GC finishes concurrent code cache purging.
>> 8. The GC finishes concurrent metaspace purging (the Method* is now dead).
>> 9. The sweeper wakes up from its slumber and starts processing the now freed Method* of the metadata relocation.
>> 10. Crash.
>> The key here is that the reason why the sweeper clears metadata relocations of static/opt virtual call stubs,
>> is to unlink the metadata of is_alive() && !is_unloading() nmethods, so that they are not used after the class
>> unloading handshake. However, the GC does not unlink said metadata from is_alive() && is_unloading() nmethods,
>> because it is assumed that nobody should follow metadata from is_unloading() nmethods. This violation in the
>> sweeper is the exception to that rule.
>> The fix is to skip the processing of metadata relocations of is_unloading() nmethods, during inline cache cleaning.
>> I also noticed another related but harmless violation of this rule by class redefinition, where metadata only filters
>> is_alive(), not && !is_unloading(). I figured that this is actually harmless, because class redefinition happens in
>> a safepoint, and code cache purging is done with the suspendible thread set taken, blocking out safepoints. Therefore
>> it will happen either completely before or after code cache purging. If it happens before, then the metadata is not
>> yet deleted, and if it happens after, then it is no longer is_alive(). However, I would like to fix that anyway,
>> just to make the model more consistent.
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8240693
>> Webrev:
>> http://cr.openjdk.java.net/~eosterlund/8240693/webrev.00/
>> Sanity checked with hs-tier1-8 with ZGC.
>> Thanks,
>> /Erik



More information about the hotspot-dev mailing list