RFR: 8240693: Sweeper should not examine dying metadata in is_unloading() nmethod during static call stub cleaning

Per Liden per.liden at oracle.com
Tue Apr 7 10:46:36 UTC 2020


Looks good!

/Per

On 4/2/20 6:28 PM, Erik Ă–sterlund wrote:
> Hi,
> 
> There is currently an unfortunate race condition between concurrent code 
> cache unloading and the Sweeper
> (affecting ZGC and Shenandoah).
> 
> First of all, big thanks to Nils for the initial analysis, requiring 
> downloading a 90 GB core file. Ouch.
> He discovered that we crashed accessing a freed metadata relocation in 
> an unloaded nmethod from the sweeper.
> 
> I am fairly certain it is due to the following race condition that can 
> indeed happen today:
> 
> 1. GC terminates marking in a safepoint
> 2. GC performs concurrent code cache unlinking. In this step, inline 
> caches are cleaned for is_alive() &&
>     !is_unloading() nmethods. The key is that it is not performed for 
> is_alive() && is_unloading() nmethods.
>     So metadata relocations of their static/opt virtual call stubs 
> remain in the nmethod, and could potentially
>     be dying.
> 3. A handshake operation is issued to separate unlinking from purging. 
> The sweeper replies to the handshake.
> 4. GC performs concurrent code cache purging, making all nmethods 
> unloaded. But just before one nmethod is unloaded...
> 5. The sweeper starts processing the nmethod. It is is_alive() && 
> is_unloading() and the sweeper correctly
>     decides to clean its inline caches. As part of that inline cache 
> cleaning, it examines if the metadata
>     relocations of static/opt virtual call stubs need clearing 
> (conditional of the holder of the target method being unloading).
> 6. Just before examining the metadata relocation, the sweeper gets 
> preempted.
> 7. The GC finishes concurrent code cache purging.
> 8. The GC finishes concurrent metaspace purging (the Method* is now dead).
> 9. The sweeper wakes up from its slumber and starts processing the now 
> freed Method* of the metadata relocation.
> 10. Crash.
> 
> The key here is that the reason why the sweeper clears metadata 
> relocations of static/opt virtual call stubs,
> is to unlink the metadata of is_alive() && !is_unloading() nmethods, so 
> that they are not used after the class
> unloading handshake. However, the GC does not unlink said metadata from 
> is_alive() && is_unloading() nmethods,
> because it is assumed that nobody should follow metadata from 
> is_unloading() nmethods. This violation in the
> sweeper is the exception to that rule.
> 
> The fix is to skip the processing of metadata relocations of 
> is_unloading() nmethods, during inline cache cleaning.
> I also noticed another related but harmless violation of this rule by 
> class redefinition, where metadata only filters
> is_alive(), not && !is_unloading(). I figured that this is actually 
> harmless, because class redefinition happens in
> a safepoint, and code cache purging is done with the suspendible thread 
> set taken, blocking out safepoints. Therefore
> it will happen either completely before or after code cache purging. If 
> it happens before, then the metadata is not
> yet deleted, and if it happens after, then it is no longer is_alive(). 
> However, I would like to fix that anyway,
> just to make the model more consistent.
> 
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8240693
> 
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8240693/webrev.00/
> 
> Sanity checked with hs-tier1-8 with ZGC.
> 
> Thanks,
> /Erik


More information about the hotspot-dev mailing list