RFR: 8240693: Sweeper should not examine dying metadata in is_unloading() nmethod during static call stub cleaning
Erik Österlund
erik.osterlund at oracle.com
Thu Apr 2 16:28:51 UTC 2020
Hi,
There is currently an unfortunate race condition between concurrent code
cache unloading and the Sweeper
(affecting ZGC and Shenandoah).
First of all, big thanks to Nils for the initial analysis, requiring
downloading a 90 GB core file. Ouch.
He discovered that we crashed accessing a freed metadata relocation in
an unloaded nmethod from the sweeper.
I am fairly certain it is due to the following race condition that can
indeed happen today:
1. GC terminates marking in a safepoint
2. GC performs concurrent code cache unlinking. In this step, inline
caches are cleaned for is_alive() &&
!is_unloading() nmethods. The key is that it is not performed for
is_alive() && is_unloading() nmethods.
So metadata relocations of their static/opt virtual call stubs
remain in the nmethod, and could potentially
be dying.
3. A handshake operation is issued to separate unlinking from purging.
The sweeper replies to the handshake.
4. GC performs concurrent code cache purging, making all nmethods
unloaded. But just before one nmethod is unloaded...
5. The sweeper starts processing the nmethod. It is is_alive() &&
is_unloading() and the sweeper correctly
decides to clean its inline caches. As part of that inline cache
cleaning, it examines if the metadata
relocations of static/opt virtual call stubs need clearing
(conditional of the holder of the target method being unloading).
6. Just before examining the metadata relocation, the sweeper gets
preempted.
7. The GC finishes concurrent code cache purging.
8. The GC finishes concurrent metaspace purging (the Method* is now dead).
9. The sweeper wakes up from its slumber and starts processing the now
freed Method* of the metadata relocation.
10. Crash.
The key here is that the reason why the sweeper clears metadata
relocations of static/opt virtual call stubs,
is to unlink the metadata of is_alive() && !is_unloading() nmethods, so
that they are not used after the class
unloading handshake. However, the GC does not unlink said metadata from
is_alive() && is_unloading() nmethods,
because it is assumed that nobody should follow metadata from
is_unloading() nmethods. This violation in the
sweeper is the exception to that rule.
The fix is to skip the processing of metadata relocations of
is_unloading() nmethods, during inline cache cleaning.
I also noticed another related but harmless violation of this rule by
class redefinition, where metadata only filters
is_alive(), not && !is_unloading(). I figured that this is actually
harmless, because class redefinition happens in
a safepoint, and code cache purging is done with the suspendible thread
set taken, blocking out safepoints. Therefore
it will happen either completely before or after code cache purging. If
it happens before, then the metadata is not
yet deleted, and if it happens after, then it is no longer is_alive().
However, I would like to fix that anyway,
just to make the model more consistent.
Bug:
https://bugs.openjdk.java.net/browse/JDK-8240693
Webrev:
http://cr.openjdk.java.net/~eosterlund/8240693/webrev.00/
Sanity checked with hs-tier1-8 with ZGC.
Thanks,
/Erik
More information about the hotspot-dev
mailing list