RFR: 8308507: G1: GClocker induced GCs can starve threads requiring memory leading to OOME
Thomas Schatzl
tschatzl at openjdk.org
Tue May 23 08:51:18 UTC 2023
On Mon, 22 May 2023 11:44:24 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:
> Please review this change which fixes the thread starvation problem during allocation for G1.
>
> The starvation problem is not limited to GCLocker, however, currently, it manifests as an OOME only when GCLocker is active. In other cases, the starvation only affects the "starved" thread as it may loop indefinitely.
>
> Starvation with an active GCLocker happens as below:
>
> 1. Thread A tries to allocate memory as normal, and tries to start a GC; the GCLocker is active and so the thread gets stalled waiting for the GC.
> 2. GCLocker induced GC executes and frees some memory.
> 3. Thread A does not get any of that memory, but other threads also waiting for memory.
> 4. Goto 1 until the gclocker retry count has been reached.
>
> In this change, we take the general approach to solving starvation problems with announcement tables (request queues). On slow allocation, a thread that wishes to complete an Allocation GC and then attempt an allocation, announces its allocation request before proceeding to participate in a race to execute a GC safepoint. Whichever thread succeeds in executing the Allocation GC safepoint will be tasked with completing all allocation requests that were announced before the safepoint. This guarantees that all announced allocation requests are either satisfied during the safepoint, or failed in case there is not enough memory to complete all requests. This effectively deals with the starvation issue and reduces the number of allocation GCs triggered.
>
> Note: The change also adopts ZList from ZGC and makes it available under utilities as DoublyLinkedList with slight modifications.
>
> Testing: Tier 1-7
Initial pass/comments.
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 906:
> 904: bool G1CollectedHeap::upgrade_to_full_collection() {
> 905: GCCauseSetter compaction(this, GCCause::_g1_compaction_pause);
> 906: // Reset any allocated but yet claimed allocation requests.
Suggestion:
// Reset any allocated but yet unclaimed allocation requests.
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 961:
> 959: assert_at_safepoint_on_vm_thread();
> 960:
> 961: bool success = handle_allocation_requests(false /* expect_null_mutator_alloc_region*/);
Maybe it would be good to rename `success` to something more specific to avoid confusion with `gc_succeeded`. Something like `alloc_succeeded` (compared to succeeding the gc and succeeding the entire method).
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 964:
> 962:
> 963: if (success) {
> 964: return success;
Suggestion:
return success;
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 980:
> 978:
> 979: if (success) {
> 980: return success;
Suggestion:
return success;
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 985:
> 983: // Attempt to satisfy allocation requests after full-gc also failed. We reset the allocation requests
> 984: // then execute a maximal compaction full-gc before retrying the allocations
> 985:
Superfluous newline (compared to other occurrences of similar comments)
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1006:
> 1004: for (StalledAllocReq* alloc_req; iter.next(&alloc_req);) {
> 1005: _satisfied_allocations.insert_last(alloc_req);
> 1006: alloc_req->set_state(StalledAllocReq::AllocationState::Failed);
Maybe that's just me but I prefer code changing the request first and then inserting it. I.e. process then pass on to something else instead of the other way around.
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1042:
> 1040:
> 1041: const uint active_numa_nodes = G1NUMA::numa()->num_active_nodes();
> 1042: bool *expect_null_alloc_regions = (bool *)alloca(active_numa_nodes * sizeof(bool));
Suggestion:
bool *expect_null_alloc_regions = (bool*)alloca(active_numa_nodes * sizeof(bool));
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1047:
> 1045: }
> 1046:
> 1047: while(true) {
Suggestion:
while (true) {
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1058:
> 1056: alloc_req->node_index(),
> 1057: expect_null_alloc_regions[alloc_req->node_index()]
> 1058: );
Suggestion:
expect_null_alloc_regions[alloc_req->node_index()]);
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1067:
> 1065: expect_null_alloc_regions[alloc_req->node_index()] = false;
> 1066:
> 1067: // Allocation succeeded, update the state and result of the allocation request
Suggestion:
// Allocation succeeded, update the state and result of the allocation request.
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1071:
> 1069:
> 1070: if (is_humongous(alloc_req->size())) {
> 1071: // Calculate payload size and initialize the humongous object with a fillerArray
Suggestion:
// Calculate payload size and initialize the humongous object with a fillerArray.
It is nowhere explained why this is needed.
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1089:
> 1087: }
> 1088:
> 1089: // Move the allocation request from stalled to satisfied list
Suggestion:
// Move the allocation request from stalled to satisfied list.
src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1112:
> 1110: }
> 1111:
> 1112: // Attempting to expand the heap sufficiently
Suggestion:
// Attempts to expand the heap sufficiently
* the formatting of this comment paragraph is very "jagged"
* this looks like a comment that should go into the .hpp file
src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 172:
> 170: friend class G1CheckRegionAttrTableClosure;
> 171:
> 172: class StalledAllocReq : public DoublyLinkedListNode {
Please add a short comment about it; maybe it would even be useful to describe the lifecycle (state transitions) of the allocation request here.
src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 174:
> 172: class StalledAllocReq : public DoublyLinkedListNode {
> 173: public:
> 174: enum class AllocationState {
Suggestion:
enum class AllocationState {
src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 179:
> 177: Pending,
> 178: };
> 179: StalledAllocReq(size_t size, uint numa_node) :
Suggestion:
StalledAllocReq(size_t size, uint numa_node) :
src/hotspot/share/gc/g1/g1CollectedHeap.hpp line 190:
> 188: size_t size() {
> 189: return _size;
> 190: }
These trivial getters should probably be written in a single line.
src/hotspot/share/gc/g1/g1ConcurrentMark.cpp line 1110:
> 1108: if (hr->is_humongous()) {
> 1109: oop obj = cast_to_oop(hr->humongous_start_region()->bottom());
> 1110: if (G1CollectedHeap::is_obj_filler(obj)) { // Object allocated, but not well-formed
Suggestion:
if (G1CollectedHeap::is_obj_filler(obj)) { // Object allocated, but not well-formed.
src/hotspot/share/gc/g1/g1ConcurrentMark.cpp line 1163:
> 1161: if (hr->is_starts_humongous()) {
> 1162: oop obj = cast_to_oop(hr->bottom());
> 1163: if (G1CollectedHeap::is_obj_filler(obj)) { // Object allocated, but not well-formed
Same as above.
src/hotspot/share/gc/g1/g1VMOperations.cpp line 132:
> 130: // Any allocation requests that were handled during a previous GC safepoint but have not been observed
> 131: // by the requesting mutator thread should be reset to pending. This makes it easier for the current GC to
> 132: // treat the unclaimed memory as garbage.
Suggestion:
// by the requesting mutator thread should be reset to pending. This makes it easier for the current GC to
// treat the unclaimed memory as garbage. It also simplifies the initial allocation in the safepoint next.
This might cause additional gcs. What would happen if `handle_allocation_requests` just skipped already satisfied allocations (as successful) and only if that fails reset all requests (i.e. around line 148)?
src/hotspot/share/gc/g1/g1VMOperations.cpp line 139:
> 137:
> 138: if (has_pending_allocations) {
> 139: bool success = g1h->handle_allocation_requests(false /* expect_null_mutator_alloc_region*/);
Same comment about the `success` name as elsewhere.
src/hotspot/share/gc/g1/g1VMOperations.cpp line 139:
> 137:
> 138: if (has_pending_allocations) {
> 139: bool success = g1h->handle_allocation_requests(false /* expect_null_mutator_alloc_region*/);
Suggestion:
bool gc_succeeded = false;
// If any allocation has been requested, try to do that first.
bool has_pending_allocations = !g1h->_stalled_allocations.is_empty();
if (has_pending_allocations) {
bool success = g1h->handle_allocation_requests(false /* expect_null_mutator_alloc_region*/);
src/hotspot/share/gc/g1/g1VMOperations.cpp line 143:
> 141: if (success) {
> 142: return;
> 143: }
Suggestion:
if (success) {
return;
}
src/hotspot/share/gc/shared/collectedHeap.inline.hpp line 50:
> 48:
> 49: size_t CollectedHeap::filler_array_hdr_size() {
> 50: return align_object_offset(arrayOopDesc::header_size(T_INT)); // align to Long
Suggestion:
return align_object_offset(arrayOopDesc::header_size(T_INT)); // Align to INT.
(pre-existing)
src/hotspot/share/gc/shared/collectedHeap.inline.hpp line 54:
> 52:
> 53: size_t CollectedHeap::filler_array_min_size() {
> 54: return align_object_size(filler_array_hdr_size()); // align to MinObjAlignment
Suggestion:
return align_object_size(filler_array_hdr_size()); // Align to MinObjAlignment.
(pre-existing)
src/hotspot/share/utilities/doublyLinkedList.hpp line 2:
> 1: /*
> 2: * Copyright (c) 2015, 2020, Oracle and/or its affiliates. All rights reserved.
Suggestion:
* Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
(This is a new file, isn't it? An alternative would be to use `2015, 2023,` here.)
test/hotspot/jtreg/gc/TestAllocHumongousFragment.java line 175:
> 173: * @library /test/lib
> 174: *
> 175: * @run main/othervm -Xlog:gc -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -Xmx1g -Xms1g
Is this change intentional?
-------------
Changes requested by tschatzl (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/14077#pullrequestreview-1438928751
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201740872
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201745207
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201741950
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201742801
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201746341
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201751478
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201757214
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201756543
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201759971
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201760757
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201761316
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201764617
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201766512
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201769172
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201769417
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201769892
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201770800
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201776160
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201791853
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201813308
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201803997
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201807696
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201803304
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201821648
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201822329
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201822647
PR Review Comment: https://git.openjdk.org/jdk/pull/14077#discussion_r1201827147
More information about the hotspot-gc-dev
mailing list