RFR: 8240693: Sweeper should not examine dying metadata in is_unloading() nmethod during static call stub cleaning

Thu Apr 2 18:59:40 UTC 2020

Erik,

The redefinition part looks good to me.  It should have been is_alive 
and not unloading, like the code cache walk to determine evol dependent 
nmethods.

thanks,
Coleen

On 4/2/20 12:28 PM, Erik Österlund wrote:
> Hi,
>
> There is currently an unfortunate race condition between concurrent 
> code cache unloading and the Sweeper
> (affecting ZGC and Shenandoah).
>
> First of all, big thanks to Nils for the initial analysis, requiring 
> downloading a 90 GB core file. Ouch.
> He discovered that we crashed accessing a freed metadata relocation in 
> an unloaded nmethod from the sweeper.
>
> I am fairly certain it is due to the following race condition that can 
> indeed happen today:
>
> 1. GC terminates marking in a safepoint
> 2. GC performs concurrent code cache unlinking. In this step, inline 
> caches are cleaned for is_alive() &&
>    !is_unloading() nmethods. The key is that it is not performed for 
> is_alive() && is_unloading() nmethods.
>    So metadata relocations of their static/opt virtual call stubs 
> remain in the nmethod, and could potentially
>    be dying.
> 3. A handshake operation is issued to separate unlinking from purging. 
> The sweeper replies to the handshake.
> 4. GC performs concurrent code cache purging, making all nmethods 
> unloaded. But just before one nmethod is unloaded...
> 5. The sweeper starts processing the nmethod. It is is_alive() && 
> is_unloading() and the sweeper correctly
>    decides to clean its inline caches. As part of that inline cache 
> cleaning, it examines if the metadata
>    relocations of static/opt virtual call stubs need clearing 
> (conditional of the holder of the target method being unloading).
> 6. Just before examining the metadata relocation, the sweeper gets 
> preempted.
> 7. The GC finishes concurrent code cache purging.
> 8. The GC finishes concurrent metaspace purging (the Method* is now 
> dead).
> 9. The sweeper wakes up from its slumber and starts processing the now 
> freed Method* of the metadata relocation.
> 10. Crash.
>
> The key here is that the reason why the sweeper clears metadata 
> relocations of static/opt virtual call stubs,
> is to unlink the metadata of is_alive() && !is_unloading() nmethods, 
> so that they are not used after the class
> unloading handshake. However, the GC does not unlink said metadata 
> from is_alive() && is_unloading() nmethods,
> because it is assumed that nobody should follow metadata from 
> is_unloading() nmethods. This violation in the
> sweeper is the exception to that rule.
>
> The fix is to skip the processing of metadata relocations of 
> is_unloading() nmethods, during inline cache cleaning.
> I also noticed another related but harmless violation of this rule by 
> class redefinition, where metadata only filters
> is_alive(), not && !is_unloading(). I figured that this is actually 
> harmless, because class redefinition happens in
> a safepoint, and code cache purging is done with the suspendible 
> thread set taken, blocking out safepoints. Therefore
> it will happen either completely before or after code cache purging. 
> If it happens before, then the metadata is not
> yet deleted, and if it happens after, then it is no longer is_alive(). 
> However, I would like to fix that anyway,
> just to make the model more consistent.
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8240693
>
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8240693/webrev.00/
>
> Sanity checked with hs-tier1-8 with ZGC.
>
> Thanks,
> /Erik