RFR: 8373819: GenShen: Requested generation may be null [v4]

Y. Srinivas Ramakrishna ysr at openjdk.org
Sat Jan 10 00:17:42 UTC 2026


On Fri, 9 Jan 2026 17:52:53 GMT, William Kemper <wkemper at openjdk.org> wrote:

>> This PR simplifies the generational control thread by decoupling it somewhat from the heap/gc cancellation mechanism. This is meant to prevent the control thread from seeing inconsistencies between `shHeap::_cancelled_gc` and `shGenControlThread::_requested_gc_cause`.
>
> William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits:
> 
>  - Merge remote-tracking branch 'jdk/master' into fix-null-generation-crash
>  - Merge remote-tracking branch 'jdk/master' into fix-null-generation-crash
>  - Fix typo in assertion message
>  - Take regulator thread out of STS before requesting GC
>    
>    The request may block while it waits for control thread to stop old marking. If workers are already in the STS, and the regulator thread is still in the STS, but cannot yield, the safepoint will not run. Control, worker and regulator threads deadlock each other.
>  - Add comments
>  - Revert back to what should be on this branch
>  - Merge remote-tracking branch 'jdk/master' into fix-null-generation-crash
>  - Don't know how this file got deleted
>  - Carry over gc cancellation to gc request
>  - Do not let allocation failure requests be overwritten by other requests
>  - ... and 4 more: https://git.openjdk.org/jdk/compare/f5fa9e40...2e57f0ac

I left a few comments and my uneasiness with the code as structured, mainly because I am not sure I understand the interaction protocol between the interacting threads clearly enough or the state being protected by the locks to be certain that this all works correctly. Piecemeal these all make sense, but I don't have a sufficient overall understanding to say confidently that this code looks good.

However, since this has been shown to fix an existing production issue, I'll go ahead and approve it.

I would really like this interaction protocol clearly written down and reasoned through to ensure that it's right, and to structure it in a manner that makes it easier to reason about and maintain.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 95:

> 93:   notify_gc_waiters();
> 94:   notify_alloc_failure_waiters();
> 95:   set_gc_mode(stopped);

Will they "observe the shutdown" if mode isn't "stopped"? Should line 95 move before line 93? Also is `gc_mode`'s state transitions protected in any manner, so they are consistent with any other observable state from the standpoint of threads that may interact with the controller? I see for example that the RegulatorThread looks at the controller thread's gc_mode to make certain GC triggering decisions, but it doesn't look like the reading of the mode and the requesting of a GC are protected by a lock that prevents races/glitches. In general such interactions lead to an increase in the state-space of interactions that the parties need to deal with correctly.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 708:

> 706: 
> 707: void ShenandoahGenerationalControlThread::notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation) {
> 708:   assert(_control_lock.is_locked(), "Request lock must be held here");

What is MonitorLocker, and what is `_control_lock`? Why are we checking `_control_lock` here? If ml is being passed here for the purposes of notification on it, it must be the case that it's locked and the assert at line 708 is a leakage of abstraction? I see that your problem here is that along one path you do nothing and and don't post a notification and along another you do some work and post a notification. It sounds like what you want instead is an API of the following shape:


   bool should_notify_control_thread(cause, generation) { ... };


and callers might do:


   MonitorLocker ml(...);
   if (should_notify_control_thread(cause, generation)) {
      ml.notify();
   }
   ml.wait();

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 721:

> 719: }
> 720: 
> 721: void ShenandoahGenerationalControlThread::notify_control_thread(GCCause::Cause cause) {

Apropos my comment above, this code has a bad smell that some versions of the method expect the lock to be held, and other methods acquire the lock.

It makes reasoning about the code at an abstract level very error-prone, and potentially difficult to maintain correctly over time.

-------------

Marked as reviewed by ysr (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/28932#pullrequestreview-3645834579
PR Review Comment: https://git.openjdk.org/jdk/pull/28932#discussion_r2677970564
PR Review Comment: https://git.openjdk.org/jdk/pull/28932#discussion_r2677861000
PR Review Comment: https://git.openjdk.org/jdk/pull/28932#discussion_r2677882435


More information about the hotspot-gc-dev mailing list