RFR: 8240693: Sweeper should not examine dying metadata in is_unloading() nmethod during static call stub cleaning

Thu Apr 2 16:28:51 UTC 2020

Hi,

There is currently an unfortunate race condition between concurrent code 
cache unloading and the Sweeper
(affecting ZGC and Shenandoah).

First of all, big thanks to Nils for the initial analysis, requiring 
downloading a 90 GB core file. Ouch.
He discovered that we crashed accessing a freed metadata relocation in 
an unloaded nmethod from the sweeper.

I am fairly certain it is due to the following race condition that can 
indeed happen today:

1. GC terminates marking in a safepoint
2. GC performs concurrent code cache unlinking. In this step, inline 
caches are cleaned for is_alive() &&
    !is_unloading() nmethods. The key is that it is not performed for 
is_alive() && is_unloading() nmethods.
    So metadata relocations of their static/opt virtual call stubs 
remain in the nmethod, and could potentially
    be dying.
3. A handshake operation is issued to separate unlinking from purging. 
The sweeper replies to the handshake.
4. GC performs concurrent code cache purging, making all nmethods 
unloaded. But just before one nmethod is unloaded...
5. The sweeper starts processing the nmethod. It is is_alive() && 
is_unloading() and the sweeper correctly
    decides to clean its inline caches. As part of that inline cache 
cleaning, it examines if the metadata
    relocations of static/opt virtual call stubs need clearing 
(conditional of the holder of the target method being unloading).
6. Just before examining the metadata relocation, the sweeper gets 
preempted.
7. The GC finishes concurrent code cache purging.
8. The GC finishes concurrent metaspace purging (the Method* is now dead).
9. The sweeper wakes up from its slumber and starts processing the now 
freed Method* of the metadata relocation.
10. Crash.

The key here is that the reason why the sweeper clears metadata 
relocations of static/opt virtual call stubs,
is to unlink the metadata of is_alive() && !is_unloading() nmethods, so 
that they are not used after the class
unloading handshake. However, the GC does not unlink said metadata from 
is_alive() && is_unloading() nmethods,
because it is assumed that nobody should follow metadata from 
is_unloading() nmethods. This violation in the
sweeper is the exception to that rule.

The fix is to skip the processing of metadata relocations of 
is_unloading() nmethods, during inline cache cleaning.
I also noticed another related but harmless violation of this rule by 
class redefinition, where metadata only filters
is_alive(), not && !is_unloading(). I figured that this is actually 
harmless, because class redefinition happens in
a safepoint, and code cache purging is done with the suspendible thread 
set taken, blocking out safepoints. Therefore
it will happen either completely before or after code cache purging. If 
it happens before, then the metadata is not
yet deleted, and if it happens after, then it is no longer is_alive(). 
However, I would like to fix that anyway,
just to make the model more consistent.

Bug:
https://bugs.openjdk.java.net/browse/JDK-8240693

Webrev:
http://cr.openjdk.java.net/~eosterlund/8240693/webrev.00/

Sanity checked with hs-tier1-8 with ZGC.

Thanks,
/Erik