RFR: 8240693: Sweeper should not examine dying metadata in is_unloading() nmethod during static call stub cleaning
Erik Österlund
erik.osterlund at oracle.com
Tue Apr 7 20:56:18 UTC 2020
Hi Vladimir,
Thanks for the review!
/Erik
> On 7 Apr 2020, at 19:34, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
>
> +1
>
> Thanks,
> Vladimir
>
>>> On 4/7/20 3:46 AM, Per Liden wrote:
>> Looks good!
>> /Per
>>> On 4/2/20 6:28 PM, Erik Österlund wrote:
>>> Hi,
>>> There is currently an unfortunate race condition between concurrent code cache unloading and the Sweeper
>>> (affecting ZGC and Shenandoah).
>>> First of all, big thanks to Nils for the initial analysis, requiring downloading a 90 GB core file. Ouch.
>>> He discovered that we crashed accessing a freed metadata relocation in an unloaded nmethod from the sweeper.
>>> I am fairly certain it is due to the following race condition that can indeed happen today:
>>> 1. GC terminates marking in a safepoint
>>> 2. GC performs concurrent code cache unlinking. In this step, inline caches are cleaned for is_alive() &&
>>> !is_unloading() nmethods. The key is that it is not performed for is_alive() && is_unloading() nmethods.
>>> So metadata relocations of their static/opt virtual call stubs remain in the nmethod, and could potentially
>>> be dying.
>>> 3. A handshake operation is issued to separate unlinking from purging. The sweeper replies to the handshake.
>>> 4. GC performs concurrent code cache purging, making all nmethods unloaded. But just before one nmethod is unloaded...
>>> 5. The sweeper starts processing the nmethod. It is is_alive() && is_unloading() and the sweeper correctly
>>> decides to clean its inline caches. As part of that inline cache cleaning, it examines if the metadata
>>> relocations of static/opt virtual call stubs need clearing (conditional of the holder of the target method being unloading).
>>> 6. Just before examining the metadata relocation, the sweeper gets preempted.
>>> 7. The GC finishes concurrent code cache purging.
>>> 8. The GC finishes concurrent metaspace purging (the Method* is now dead).
>>> 9. The sweeper wakes up from its slumber and starts processing the now freed Method* of the metadata relocation.
>>> 10. Crash.
>>> The key here is that the reason why the sweeper clears metadata relocations of static/opt virtual call stubs,
>>> is to unlink the metadata of is_alive() && !is_unloading() nmethods, so that they are not used after the class
>>> unloading handshake. However, the GC does not unlink said metadata from is_alive() && is_unloading() nmethods,
>>> because it is assumed that nobody should follow metadata from is_unloading() nmethods. This violation in the
>>> sweeper is the exception to that rule.
>>> The fix is to skip the processing of metadata relocations of is_unloading() nmethods, during inline cache cleaning.
>>> I also noticed another related but harmless violation of this rule by class redefinition, where metadata only filters
>>> is_alive(), not && !is_unloading(). I figured that this is actually harmless, because class redefinition happens in
>>> a safepoint, and code cache purging is done with the suspendible thread set taken, blocking out safepoints. Therefore
>>> it will happen either completely before or after code cache purging. If it happens before, then the metadata is not
>>> yet deleted, and if it happens after, then it is no longer is_alive(). However, I would like to fix that anyway,
>>> just to make the model more consistent.
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8240693
>>> Webrev:
>>> http://cr.openjdk.java.net/~eosterlund/8240693/webrev.00/
>>> Sanity checked with hs-tier1-8 with ZGC.
>>> Thanks,
>>> /Erik
More information about the hotspot-dev
mailing list