RFR: 8308507: G1: GClocker induced GCs can starve threads requiring memory leading to OOME

Tue May 23 09:44:58 UTC 2023

On Tue, 23 May 2023 08:40:21 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Please review this change which fixes the thread starvation problem during allocation for G1.
>> 
>> The starvation problem is not limited to GCLocker, however, currently, it manifests as an OOME only when GCLocker is active. In other cases, the starvation only affects the "starved" thread as it may loop indefinitely. 
>> 
>> Starvation with an active GCLocker happens as below:
>> 
>> 1. Thread A tries to allocate memory as normal, and tries to start a GC; the GCLocker is active and so the thread gets stalled waiting for the GC.
>> 2. GCLocker induced GC executes and frees some memory.
>> 3. Thread A does not get any of that memory, but other threads also waiting for memory.
>> 4. Goto 1 until the gclocker retry count has been reached.
>> 
>> In this change, we take the general approach to solving starvation problems with announcement tables (request queues). On slow allocation, a thread that wishes to complete an Allocation GC and then attempt an allocation, announces its allocation request before proceeding to participate in a race to execute a GC safepoint. Whichever thread succeeds in executing the Allocation GC safepoint will be tasked with completing all allocation requests that were announced before the safepoint. This guarantees that all announced allocation requests are either satisfied during the safepoint, or failed in case there is not enough memory to complete all requests. This effectively deals with the starvation issue and reduces the number of allocation GCs triggered.
>> 
>> Note: The change also adopts ZList from ZGC and makes it available under utilities as DoublyLinkedList with slight modifications. 
>> 
>> Testing: Tier 1-7
>
> src/hotspot/share/gc/g1/g1VMOperations.cpp line 132:
> 
>> 130:   // Any allocation requests that were handled during a previous GC safepoint but have not been observed
>> 131:   // by the requesting mutator thread should be reset to pending. This makes it easier for the current GC to
>> 132:   // treat the unclaimed memory as garbage.
> 
> Suggestion:
> 
>   // by the requesting mutator thread should be reset to pending. This makes it easier for the current GC to
>   // treat the unclaimed memory as garbage. It also simplifies the initial allocation in the safepoint next.
> 
> This might cause additional gcs. What would happen if `handle_allocation_requests` just skipped already satisfied allocations (as successful) and only if that fails reset all requests (i.e. around line 148)?

This would create a dependence between handle_allocation_requests, and any collections that happen before. So we would have to deal with how these `fillerObjects` are treated by the collections, some could be invalidated.

> src/hotspot/share/utilities/doublyLinkedList.hpp line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2015, 2020, Oracle and/or its affiliates. All rights reserved.
> 
> Suggestion:
> 
>  * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> 
> (This is a new file, isn't it? An alternative would be to use `2015, 2023,` here.)

I wasn't sure about how to approach this, its a new file but code is taken from ZList.

> test/hotspot/jtreg/gc/TestAllocHumongousFragment.java line 175:
> 
>> 173:  * @library /test/lib
>> 174:  *
>> 175:  * @run main/othervm -Xlog:gc -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -Xmx1g -Xms1g
> 
> Is this change intentional?

This should be moved to a different PR.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201886847
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201882088
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201880343