RFR: 8373100: Genshen: Control thread can miss allocation failure notification [v2]

Thu Dec 11 00:00:37 UTC 2025

On Fri, 5 Dec 2025 18:47:56 GMT, William Kemper <wkemper at openjdk.org> wrote:

>> William Kemper has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Set requested gc cause under a lock when allocation fails
>
> src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 145:
> 
>> 143:   // Notifies the control thread, but does not update the requested cause or generation.
>> 144:   // The overloaded variant should be used when the _control_lock is already held.
>> 145:   void notify_cancellation(GCCause::Cause cause);
> 
> These methods were the root cause here. `ShenandoahHeap::_canceled_gc` is read/written atomically, but `ShenandoahGenerationalControlThread::_requested_gc_cause` is read/written under a lock. These `notify_cancellation` methods did _not_ update `_requested_gc_cause` at all. So, in the failure I observed we had:
> 1. Control thread finishes cycle and sees no cancellation is requested (no lock used).
> 2. Mutator thread fails allocation, cancels GC (again, no lock used), and does _not_ change `_requested_gc_cause`.
> 3. Control thread takes `_control_lock` and checks `_requested_gc_cause` and sees  `_no_gc`  (because `notify_cancellation` didn't change it) and `waits` forever now.
> 
> 
> The fix here is to replace `notify_cancellation` with `notify_control_thread` which serializes updates to `_requested_gc_cause` under  `_control_lock`.

I was looking at the places where `ShenandoahHeap::clear_cancelled_gc` is called, I feel the problem is more likely from op_final_update_refs:

void ShenandoahConcurrentGC::op_final_update_refs() {
  ShenandoahHeap* const heap = ShenandoahHeap::heap();
   ... 
  ...
  // Clear cancelled GC, if set. On cancellation path, the block before would handle
  // everything.
  if (heap->cancelled_gc()) {
    heap->clear_cancelled_gc();
  }
  ...
  ...
}

Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:
1. The mutator tries to cancel the the concurrent GC and notify controller thread.
2. The mutator block itself at `_alloc_failure_waiters_lock`, claiming safepoint safe as well. 
3. concurrent GC enter the final update refs (VM operation)
4. in final update refs, VMThread sees cancelled_gc and clear it.
5. concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator. 

The fix seem to work in generational mode, but may not work in non-generational mode.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28665#discussion_r2608573677