RFR: 8305403: Shenandoah evacuation workers may deadlock
Roman Kennke
rkennke at openjdk.org
Fri Apr 7 20:07:46 UTC 2023
On Mon, 3 Apr 2023 21:45:39 GMT, William Kemper <wkemper at openjdk.org> wrote:
> Shenandoah evacuation workers may deadlock if a safepoint begins during concurrent evacuation _and_ a GC worker thread experiences an out of memory error during the evacuation.
>
> This situation occurs because some of the workers have set the cancelled state to `NOT_CANCELLED` and then yielded to the suspendible thread set because a safepoint is starting. Other workers who then experience an OOM during evacuation attempt to transition the heap from `CANCELLABLE` to `CANCELLED` in a CAS loop that will never succeed (because the cancelled state is `NOT_CANCELLED`). These workers are unable to join the suspendible thread set, so the threads which have yielded are unable to resume and reset the heap to `CANCELLABLE`. The VM thread cannot enter the safepoint and eventually all of the mutator threads block when they are unable to allocate.
>
> The changes here remove the `NOT_CANCELLED` state and remove the CAS loop used to transition from `CANCELLABLE` to `CANCELLED`. Additionally, worker threads that need to suspend no longer put the heap in the `NOT_CANCELLED` state before suspending.
>
> In order to test this, the behavior of the diagnostic flag was modified to provoke this scenario:
>
> diff --git a/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp b/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
> index 4158f4bee22..e261dd3a81b 100644
> --- a/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
> +++ b/src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp
> @@ -292,8 +292,7 @@ inline oop ShenandoahHeap::evacuate_object(oop p, Thread* thread) {
> HeapWord* copy = nullptr;
>
> #ifdef ASSERT
> - if (ShenandoahOOMDuringEvacALot &&
> - (os::random() & 1) == 0) { // Simulate OOM every ~2nd slow-path call
> + if (ShenandoahOOMDuringEvacALot && thread->is_Worker_thread() && SuspendibleThreadSet::should_yield() && uint(os::random()) % 1000 < 1) {
> copy = nullptr;
> } else {
> #endif
>
>
> This applies the failure only to worker threads and only when they are expected to suspend and only with a 1/1000 chance of failing for each evacuation. Without these fixes, the modification to `ShenandoahOOMDuringEvacALot` causes a deadlock in under two minutes of execution on [the extremem benchmark](https://github.com/corretto/heapothesys/tree/master/Extremem) with these additional diagnostic flags:
>
> -XX:+SafepointALot -XX:+ShenandoahOOMDuringEvacALot -XX:ConcGCThreads=12 -XX:GuaranteedSafepointInterval=50
>
> After these changes, the benchmark no longer deadlocks.
>
> Additional testing includes jtreg `test/hotspot/jtreg:hotspot_gc_shenandoah`, the Dacapo benchmark suite, HyperAlloc and specjbb 2015.
Given that I can't remember why we (I) did the complex transition in the first place, let's get rid of it and see what explodes (if anything). There's good chances that the original reasons are no longer relevant. We might even consider a more aggressive approach, see my comments (your call).
src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1743:
> 1741:
> 1742: bool ShenandoahHeap::try_cancel_gc() {
> 1743: jbyte prev = _cancelled_gc.cmpxchg(CANCELLED, CANCELLABLE);
Is there even a need to CAS here? The transition here can only go from CANCELLABLE to CANCELLED, is that correct? If that is so, then the CAS would only fail if another thread made that transition before us. Is that a relevant scenario that we must know about, or could we simply store CANCELLED?
src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 337:
> 335: "end of concurrent marking.") \
> 336: \
> 337: product(bool, ShenandoahSuspendibleWorkers, true, EXPERIMENTAL, \
Given that turning this off would most likely result in correctness issues, or certainly will in the future when Lilliput arrives, we should consider removing this option altogether.
-------------
Marked as reviewed by rkennke (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/13309#pullrequestreview-1376540055
PR Review Comment: https://git.openjdk.org/jdk/pull/13309#discussion_r1160933835
PR Review Comment: https://git.openjdk.org/jdk/pull/13309#discussion_r1160934484
More information about the hotspot-gc-dev
mailing list