RFR: 8305403: Shenandoah evacuation workers may deadlock

William Kemper wkemper at openjdk.org
Tue Apr 4 00:03:07 UTC 2023


On Mon, 3 Apr 2023 23:11:30 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

>> Shenandoah evacuation workers may deadlock if a safepoint begins during concurrent evacuation _and_ a GC worker thread experiences an out of memory error during the evacuation.
>> 
>> This situation occurs because some of the workers have set the cancelled state to `NOT_CANCELLED` and then yielded to the suspendible thread set because a safepoint is starting. Other workers who then experience an OOM during evacuation attempt to transition the heap from `CANCELLABLE` to `CANCELLED` in a CAS loop that will never succeed (because the cancelled state is `NOT_CANCELLED`). These workers are unable to join the suspendible thread set, so the threads which have yielded are unable to resume and reset the heap to `CANCELLABLE`. The VM thread cannot enter the safepoint and eventually all of the mutator threads block when they are unable to allocate.
>> 
>> The changes here remove the `NOT_CANCELLED` state and remove the CAS loop used to transition from `CANCELLABLE` to `CANCELLED`. Additionally, worker threads that need to suspend no longer put the heap in the `NOT_CANCELLED` state before suspending.
>> 
>> In order to test this, the behavior of the diagnostic flag was modified to provoke this scenario:
>> 
>> diff --git a/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp b/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
>> index 4158f4bee22..e261dd3a81b 100644
>> --- a/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
>> +++ b/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
>> @@ -292,8 +292,7 @@ inline oop ShenandoahHeap::evacuate_object(oop p, Thread* thread) {
>>    HeapWord* copy = nullptr;
>>  
>>  #ifdef ASSERT
>> -  if (ShenandoahOOMDuringEvacALot &&
>> -      (os::random() & 1) == 0) { // Simulate OOM every ~2nd slow-path call
>> +  if (ShenandoahOOMDuringEvacALot && thread->is_Worker_thread() && SuspendibleThreadSet::should_yield() && uint(os::random()) % 1000 < 1) {
>>          copy = nullptr;
>>    } else {
>>  #endif
>> 
>> 
>> This applies the failure only to worker threads and only when they are expected to suspend and only with a 1/1000 chance of failing for each evacuation. Without these fixes, the modification to `ShenandoahOOMDuringEvacALot` causes a deadlock in under two minutes of execution on [the extremem benchmark](https://github.com/corretto/heapothesys/tree/master/Extremem) with these additional diagnostic flags:
>> 
>> -XX:+SafepointALot -XX:+ShenandoahOOMDuringEvacALot -XX:ConcGCThreads=12 -XX:GuaranteedSafepointInterval=50
>> 
>> After these changes, the benchmark no longer deadlocks.
>> 
>> Additional testing includes jtreg `test/hotspot/jtreg:hotspot_gc_shenandoah`, the Dacapo benchmark suite, HyperAlloc and specjbb 2015.
>
> src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 337:
> 
>> 335:           "end of concurrent marking.")                                     \
>> 336:                                                                             \
>> 337:   product(bool, ShenandoahSuspendibleWorkers, true, EXPERIMENTAL,           \
> 
> Can we provide some more information under help as to why workers should or should not be suspendible?

I would like to know this myself ;) There are phases of class unloading that unconditionally suspend the control and worker threads. It's not clear to me why marking or evacuation threads would need to suspend.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13309#discussion_r1156564586


More information about the hotspot-gc-dev mailing list