RFR: 8305403: Shenandoah evacuation workers may deadlock
William Kemper
wkemper at openjdk.org
Fri Apr 7 22:02:44 UTC 2023
On Fri, 7 Apr 2023 20:01:52 GMT, Roman Kennke <rkennke at openjdk.org> wrote:
>> Shenandoah evacuation workers may deadlock if a safepoint begins during concurrent evacuation _and_ a GC worker thread experiences an out of memory error during the evacuation.
>>
>> This situation occurs because some of the workers have set the cancelled state to `NOT_CANCELLED` and then yielded to the suspendible thread set because a safepoint is starting. Other workers who then experience an OOM during evacuation attempt to transition the heap from `CANCELLABLE` to `CANCELLED` in a CAS loop that will never succeed (because the cancelled state is `NOT_CANCELLED`). These workers are unable to join the suspendible thread set, so the threads which have yielded are unable to resume and reset the heap to `CANCELLABLE`. The VM thread cannot enter the safepoint and eventually all of the mutator threads block when they are unable to allocate.
>>
>> The changes here remove the `NOT_CANCELLED` state and remove the CAS loop used to transition from `CANCELLABLE` to `CANCELLED`. Additionally, worker threads that need to suspend no longer put the heap in the `NOT_CANCELLED` state before suspending.
>>
>> In order to test this, the behavior of the diagnostic flag was modified to provoke this scenario:
>>
>> diff --git a/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp b/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
>> index 4158f4bee22..e261dd3a81b 100644
>> --- a/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
>> +++ b/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
>> @@ -292,8 +292,7 @@ inline oop ShenandoahHeap::evacuate_object(oop p, Thread* thread) {
>> HeapWord* copy = nullptr;
>>
>> #ifdef ASSERT
>> - if (ShenandoahOOMDuringEvacALot &&
>> - (os::random() & 1) == 0) { // Simulate OOM every ~2nd slow-path call
>> + if (ShenandoahOOMDuringEvacALot && thread->is_Worker_thread() && SuspendibleThreadSet::should_yield() && uint(os::random()) % 1000 < 1) {
>> copy = nullptr;
>> } else {
>> #endif
>>
>>
>> This applies the failure only to worker threads and only when they are expected to suspend and only with a 1/1000 chance of failing for each evacuation. Without these fixes, the modification to `ShenandoahOOMDuringEvacALot` causes a deadlock in under two minutes of execution on [the extremem benchmark](https://github.com/corretto/heapothesys/tree/master/Extremem) with these additional diagnostic flags:
>>
>> -XX:+SafepointALot -XX:+ShenandoahOOMDuringEvacALot -XX:ConcGCThreads=12 -XX:GuaranteedSafepointInterval=50
>>
>> After these changes, the benchmark no longer deadlocks.
>>
>> Additional testing includes jtreg `test/hotspot/jtreg:hotspot_gc_shenandoah`, the Dacapo benchmark suite, HyperAlloc and specjbb 2015.
>
> src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1743:
>
>> 1741:
>> 1742: bool ShenandoahHeap::try_cancel_gc() {
>> 1743: jbyte prev = _cancelled_gc.cmpxchg(CANCELLED, CANCELLABLE);
>
> Is there even a need to CAS here? The transition here can only go from CANCELLABLE to CANCELLED, is that correct? If that is so, then the CAS would only fail if another thread made that transition before us. Is that a relevant scenario that we must know about, or could we simply store CANCELLED?
The boolean result of this method is used to have only one thread log the cancellation, create an entry in the event log and start the timer for tracking how long the cancellation takes. I think this is okay how it is. We could consider just CAS'ing a boolean instead of the enum, but the enum is arguably more readable.
> src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 337:
>
>> 335: "end of concurrent marking.") \
>> 336: \
>> 337: product(bool, ShenandoahSuspendibleWorkers, true, EXPERIMENTAL, \
>
> Given that turning this off would most likely result in correctness issues, or certainly will in the future when Lilliput arrives, we should consider removing this option altogether.
I agree - should be a separate PR?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/13309#discussion_r1160989260
PR Review Comment: https://git.openjdk.org/jdk/pull/13309#discussion_r1160989917
More information about the hotspot-gc-dev
mailing list