RFR: 8192647: GClocker induced GCs can starve threads requiring memory leading to OOME
Thomas Schatzl
tschatzl at openjdk.org
Tue Feb 4 09:50:16 UTC 2025
On Thu, 30 Jan 2025 12:12:29 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:
> Here is an attempt to simplify GCLocker implementation for Serial and Parallel.
>
> GCLocker prevents GC when Java threads are in a critical region (i.e., calling JNI critical APIs). JDK-7129164 introduces an optimization that updates a shared variable (used to track the number of threads in the critical region) only if there is a pending GC request. However, this also means that after reaching a GC safepoint, we may discover that GCLocker is active, preventing a GC cycle from being invoked. The inability to perform GC at a safepoint adds complexity -- for example, a caller must retry allocation if the request fails due to GC being inhibited by GCLocker.
>
> The proposed patch uses a readers-writer lock to ensure that all Java threads exit the critical region before reaching a GC safepoint. This guarantees that once inside the safepoint, we can successfully invoke a GC cycle. The approach takes inspiration from `ZJNICritical`, but some regressions were observed in j2dbench (on Windows) and the micro-benchmark in [JDK-8232575](https://bugs.openjdk.org/browse/JDK-8232575). Therefore, instead of relying on atomic operations on a global variable when entering or leaving the critical region, this PR uses an existing thread-local variable with a store-load barrier for synchronization.
>
> Performance is neutral for all benchmarks tested: DaCapo, SPECjbb2005, SPECjbb2015, SPECjvm2008, j2dbench, and CacheStress.
>
> Test: tier1-8
* Idk if GCLocker JFR events need to be available in metadata.xml if the VM does not actually ever send it. I think it does not.
Maybe it is used to decode from old recordings, may be worth asking e.g. @egahlin .
* the bot shows a failure that this PR's CR number shows up in the problemlist, that line needs to be deleted as well. Further it would be interesting to see how many retries there are in the allocation loop with these jnilock* stress test.
* another issue, probably todo is that while Parallel GC has the emergency bailout via GC Overhead limit after excessive retries, Serial does not. Which means that it might retry for a long time, which isn't good (while it did earlier if the number of retries due to gclocker exceed that threshold)
src/hotspot/share/gc/parallel/parallelScavengeHeap.cpp line 323:
> 321: }
> 322:
> 323: if (result == nullptr) {
pre-existing: is it actually possible that `result` is not `nullptr` here? The code above always returns with a non-null result.
Maybe assert this instead.
src/hotspot/share/gc/shared/gcLocker.cpp line 86:
> 84: void GCLocker::block() {
> 85: assert(_lock->is_locked(), "precondition");
> 86: assert(Atomic::load(&_is_gc_request_pending) == false, "precondition");
Suggestion:
assert(!Atomic::load(&_is_gc_request_pending), "precondition");
src/hotspot/share/gc/shared/gcLocker.cpp line 106:
> 104:
> 105: #ifdef ASSERT
> 106: // Matching the storestore in GCLocker::exit
Suggestion:
// Matching the storestore in GCLocker::exit.
src/hotspot/share/gc/shared/gcLocker.cpp line 114:
> 112: void GCLocker::unblock() {
> 113: assert(_lock->is_locked(), "precondition");
> 114: assert(Atomic::load(&_is_gc_request_pending) == true, "precondition");
Suggestion:
assert(Atomic::load(&_is_gc_request_pending), "precondition");
src/hotspot/share/gc/shared/gcLocker.hpp line 31:
> 29: #include "memory/allStatic.hpp"
> 30: #include "runtime/mutex.hpp"
> 31:
Documentation how GCLocker works/is supposed to work is missing here. It's not exactly trivial.
src/hotspot/share/gc/shared/gcLocker.hpp line 33:
> 31:
> 32: class GCLocker: public AllStatic {
> 33: static Monitor* _lock;
Not sure if having this copy/reference to `Heap_lock` makes the code more clear than referencing `Heap_lock` directly. It needs to be `Heap_lock` anyway.
src/hotspot/share/gc/shared/gcLocker.hpp line 37:
> 35:
> 36: #ifdef ASSERT
> 37: static uint64_t _debug_count;
Maybe the variable could be named something less generic, indicating what it is counting. Or add a comment.
src/hotspot/share/gc/shared/gcLocker.inline.hpp line 40:
> 38: if (Atomic::load(&_is_gc_request_pending)) {
> 39: thread->exit_critical();
> 40: // slow-path
Suggestion:
Not sure what this `slow-path` comment helps with. Maybe it is describing the next method (but it is named very similarly), or this is an attempt to describe the true-block of the if.
In the latter case, it would maybe be better to put this comment at the start of the true-block of the if, and say something more descriptive like `// Another thread is requesting gc, enter slow path.`
Not sure, feel free to ignore, it's just that to me the comment should either be removed or put upwards a line.
src/hotspot/share/gc/shared/gcLocker.inline.hpp line 56:
> 54: if (thread->in_last_critical()) {
> 55: Atomic::add(&_debug_count, (uint64_t)-1);
> 56: // Matching the loadload in GCLocker::block
Suggestion:
// Matching the loadload in GCLocker::block.
src/hotspot/share/gc/shared/gcTraceSend.cpp line 364:
> 362: #if INCLUDE_JFR
> 363:
> 364: #endif
Please remove this empty `#if/#endif` block.
src/hotspot/share/gc/shared/gc_globals.hpp line 162:
> 160: "blocked by the GC locker") \
> 161: range(0, max_uintx) \
> 162: \
This removal should warrant a release note; while it's a diagnostic option and we can remove at a whim, it is in use to workaround issues.
src/hotspot/share/prims/whitebox.cpp line 48:
> 46: #include "gc/shared/concurrentGCBreakpoints.hpp"
> 47: #include "gc/shared/gcConfig.hpp"
> 48: #include "gc/shared/gcLocker.hpp"
Suggestion:
The file does not seem to use the `GCLocker` class anymore, please remove this line as well.
-------------
Changes requested by tschatzl (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/23367#pullrequestreview-2592106484
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940732531
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940775211
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940813063
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940779840
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940770235
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940769765
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940796501
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940793704
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940812598
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940746077
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940748992
PR Review Comment: https://git.openjdk.org/jdk/pull/23367#discussion_r1940752118
More information about the serviceability-dev
mailing list