RFR: 8373100: Genshen: Control thread can miss allocation failure notification [v2]

Thu Dec 11 00:15:11 UTC 2025

On Wed, 10 Dec 2025 23:35:45 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

>> src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 145:
>> 
>>> 143:   // Notifies the control thread, but does not update the requested cause or generation.
>>> 144:   // The overloaded variant should be used when the _control_lock is already held.
>>> 145:   void notify_cancellation(GCCause::Cause cause);
>> 
>> These methods were the root cause here. `ShenandoahHeap::_canceled_gc` is read/written atomically, but `ShenandoahGenerationalControlThread::_requested_gc_cause` is read/written under a lock. These `notify_cancellation` methods did _not_ update `_requested_gc_cause` at all. So, in the failure I observed we had:
>> 1. Control thread finishes cycle and sees no cancellation is requested (no lock used).
>> 2. Mutator thread fails allocation, cancels GC (again, no lock used), and does _not_ change `_requested_gc_cause`.
>> 3. Control thread takes `_control_lock` and checks `_requested_gc_cause` and sees  `_no_gc`  (because `notify_cancellation` didn't change it) and `waits` forever now.
>> 
>> 
>> The fix here is to replace `notify_cancellation` with `notify_control_thread` which serializes updates to `_requested_gc_cause` under  `_control_lock`.
>
> I was looking at the places where `ShenandoahHeap::clear_cancelled_gc` is called, I feel the problem is more likely from op_final_update_refs:
> 
> 
> void ShenandoahConcurrentGC::op_final_update_refs() {
>   ShenandoahHeap* const heap = ShenandoahHeap::heap();
>    ... 
>   ...
>   // Clear cancelled GC, if set. On cancellation path, the block before would handle
>   // everything.
>   if (heap->cancelled_gc()) {
>     heap->clear_cancelled_gc();
>   }
>   ...
>   ...
> }
> 
> 
> Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:
> 1. The mutator tries to cancel the the concurrent GC and notify controller thread.
> 2. The mutator block itself at `_alloc_failure_waiters_lock`, claiming safepoint safe as well. 
> 3. concurrent GC enter the final update refs (VM operation)
> 4. in final update refs, VMThread sees cancelled_gc and clear it.
> 5. concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator. 
> 
> The fix seems to work in generational mode, but may not work in non-generational mode.

While I was staring at the code ShenandoahController::handle_alloc_failure today, I found there is discrepancy between ShenandoahGenerationalControlThread and  ShenandoahControlThread, I created a [bug](https://bugs.openjdk.org/browse/JDK-8373468) to unify the behavior, we could fix the issue in ShenandoahControlThread there.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28665#discussion_r2608651279