RFR: 8351464: Shenandoah: Hang on ShenandoahController::handle_alloc_failure when run test TestAllocHumongousFragment#generational [v4]

Wed Mar 12 17:27:55 UTC 2025

On Wed, 12 Mar 2025 16:40:40 GMT, William Kemper <wkemper at openjdk.org> wrote:

>> src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 171:
>> 
>>> 169: 
>>> 170:       // If this cycle completed without being cancelled, notify waiters about it
>>> 171:       if (!heap->cancelled_gc()) {
>> 
>> I feel we should remove the test `!heap->cancelled_gc()` here, if is fine if there is single mutator thread, but in most cases there are mutator threads, then the following case could happen:
>> 1. **Mutator A** try to cancel GC and notify control thread, it will wait with `_alloc_failure_waiters_lock`, `_cancelled_cause` is set to `_allocation_failure` 
>> 2. Concurrent GC clear `_cancelled_cause` and set it to `_no_gc` in op_final_update_refs 
>> 3.  **Mutator B** try to cancel GC and successfully set  `_cancelled_cause` to `_allocation_failure` again.
>> 4. Concurrent GC finishes.
>> 5. Control thread check `!heap->cancelled_gc()` which is false, and won't wake up mutators.
>> 
>> In this case, it will delay the wake up for mutator A & B to next cycle.
>
> I believe that is the correct behavior. The mutators are waiting until there is memory available. If mutator B cannot allocate, there is no reason to believe mutator A would be able to allocate. In this case, it is fine for both mutators to wait (even if it means A has to wait an extra cycle).

Thanks for the explanation, re-read the relevant codes I think it make sense, when Mutator B fails to allocate when Concurrent GC is at `op_final_update_refs`, very unlikely there is enough space for Mutator A.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23997#discussion_r1991968117