RFR: 8308507: G1: GClocker induced GCs can starve threads requiring memory leading to OOME [v9]

Mon Jun 12 10:40:00 UTC 2023

On Sat, 10 Jun 2023 11:00:30 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

>> Please review this change which fixes the thread starvation problem during allocation for G1.
>> 
>> The starvation problem is not limited to GCLocker, however, currently, it manifests as an OOME only when GCLocker is active. In other cases, the starvation only affects the "starved" thread as it may loop indefinitely. 
>> 
>> Starvation with an active GCLocker happens as below:
>> 
>> 1. Thread A tries to allocate memory as normal, and tries to start a GC; the GCLocker is active and so the thread gets stalled waiting for the GC.
>> 2. GCLocker induced GC executes and frees some memory.
>> 3. Thread A does not get any of that memory, but other threads also waiting for memory.
>> 4. Goto 1 until the gclocker retry count has been reached.
>> 
>> In this change, we take the general approach to solving starvation problems with announcement tables (request queues). On slow allocation, a thread that wishes to complete an Allocation GC and then attempt an allocation, announces its allocation request before proceeding to participate in a race to execute a GC safepoint. Whichever thread succeeds in executing the Allocation GC safepoint will be tasked with completing all allocation requests that were announced before the safepoint. This guarantees that all announced allocation requests are either satisfied during the safepoint, or failed in case there is not enough memory to complete all requests. This effectively deals with the starvation issue and reduces the number of allocation GCs triggered.
>> 
>> Note: The change also adopts ZList from ZGC and makes it available under utilities as DoublyLinkedList with slight modifications. 
>> 
>> Testing: Tier 1-7
>
> Ivan Walulya has updated the pull request incrementally with one additional commit since the last revision:
> 
>   reset attempt_allocation_after_gc to word_size parameter

src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 486:

> 484:       // VM successfully scheduled a collection which failed to allocate. No
> 485:       // point in trying to allocate further. We'll just return null.
> 486:       log_debug(gc, alloc)("%s: Failed to allocate %zu words", Thread::current()->name(), request.size());

I think non-pending requests logic can be moved outside the while-loop. Combined with `if (!request.pending()){ break; }` inside the heap-lock-critical-region, one can get consistent msg for alloc-failure.

src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 220:

> 218:     void set_state(AllocationState state, HeapWord* result = nullptr) {
> 219:       _state = state;
> 220:       _result = result;

These two fields are accessed by diff threads. I wonder if they should be marked as `volatile`.

src/hotspot/share/gc/g1/g1VMOperations.cpp line 136:

> 134: 
> 135:   // Only if operation was really triggered by an allocation.
> 136:   if (_word_size) {

Better not leave out `> 0`.

src/hotspot/share/utilities/doublyLinkedList.hpp line 71:

> 69: class DoublyLinkedList {
> 70:   static_assert(std::is_default_constructible<T>::value,
> 71:                 "DoublyLinkedList requires default-constructible elements");

It seems a strange requirement on the elem type. Can this be relaxed if `_head` take `DoublyLinkedListNode` type instead (and maybe some other small fixes needed)?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1225580834
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1225581811
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1225582646
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1225578291