RFR: 8351464: Shenandoah: Hang on ShenandoahController::handle_alloc_failure when run test TestAllocHumongousFragment#generational [v4]
Xiaolong Peng
xpeng at openjdk.org
Wed Mar 12 00:29:53 UTC 2025
On Wed, 12 Mar 2025 00:05:05 GMT, William Kemper <wkemper at openjdk.org> wrote:
>> Failed allocations may race to cancel the GC with the collector who is working to clear the cancelled GC. When the GC wins this race, it will fail to notify threads that are waiting for the failed GC cycle to complete. This change directly tracks the number of threads waiting due to an allocation failure, rather than indirectly tracking them through the cancelled gc state.
>>
>> # Testing
>> Ran TestAllocHumongousFragment#generational 6,500 times without failures.
>
> William Kemper has updated the pull request incrementally with one additional commit since the last revision:
>
> The non-generational modes may also fail to notify waiters
src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 171:
> 169:
> 170: // If this cycle completed without being cancelled, notify waiters about it
> 171: if (!heap->cancelled_gc()) {
I feel we should remove the test `!heap->cancelled_gc()` here, if is fine if there is single mutator thread, but in most cases there are mutator threads, then the following case could happen:
1. **Mutator A** try to cancel GC and notify control thread, it will wait with `_alloc_failure_waiters_lock`, `_cancelled_cause` is set to `_allocation_failure`
2. Concurrent GC clear `_cancelled_cause` and set it to `_no_gc` in op_final_update_refs
3. **Mutator B** try to cancel GC and successfully set `_cancelled_cause` to `_allocation_failure` again.
4. Concurrent GC finishes.
5. Control thread check `!heap->cancelled_gc()` which is false, and won't wake up mutators.
In this case, it will delay the wake up for mutator A & B to next cycle.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23997#discussion_r1990344257
More information about the shenandoah-dev
mailing list