RFR: 8349094: GenShen: Race between control and regulator threads may violate assertions [v14]

Y. Srinivas Ramakrishna ysr at openjdk.org
Wed Feb 26 01:32:03 UTC 2025


On Tue, 25 Feb 2025 22:06:19 GMT, William Kemper <wkemper at openjdk.org> wrote:

>> There are several changes to the operation of Shenandoah's control threads here.
>> * The reason for cancellation is now recorded in `ShenandoahHeap::_cancelled_gc` as a `GCCause`, instead of various member variables in the control thread.
>>   *  The cancellation handling is driven entirely by the cancellation cause
>>   * The graceful shutdown, alloc failure, humongous alloc failure and preemption requested flags are all removed
>>   * The shutdown sequence is simpler
>> * The generational control thread uses a lock to coordinate updates to the requested cause and generation
>>   * APIs have been simplified to avoid converting between the generation `type` and the actual generation instance
>>   * The old heuristic, rather than the control thread itself, is now responsible for resuming old generation cycles
>>   *  The control thread doesn't loop on its own (unless the pacer is enabled).
>> 
>> ## Testing
>> * jtreg hotspot_gc_shenandoah
>> * dacapo, extremem, diluvian, specjbb2015, specjvm2018, heapothesys
>
> William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits:
> 
>  - Merge tag 'jdk-25+11' into fix-control-regulator-threads
>    
>    Added tag jdk-25+11 for changeset 0131c1bf
>  - Address review feedback (better comments, better names)
>  - Merge remote-tracking branch 'jdk/master' into fix-control-regulator-threads
>  - Old gen bootstrap cycle must make it to init mark
>  - Merge remote-tracking branch 'jdk/master' into fix-control-regulator-threads
>  - Improve message for assertion
>  - Make shutdown safer for threads requesting (or expecting) gc
>  - Do not accept requests if control thread is terminating
>  - Notify waiters when control thread terminates
>  - Add event for control thread state changes
>  - ... and 22 more: https://git.openjdk.org/jdk/compare/0131c1bf...d7858deb

A few random comments, mostly on documentation, and a few assertion suggestions.

Rest looks good. Thanks for tightening up the code via your changes in this PR.

I don't think I should need to re-review any changes you make stemming from my (rally quite minor) comments.

Reviewed; 🚢

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 114:

> 112: }
> 113: 
> 114: void ShenandoahGenerationalControlThread::check_for_request(ShenandoahGCRequest& request) {

Comment:

// Fill in the cause, generation requested, and set gc_mode, all under the lock.
// Make any relevant changes to the control state of heuristics or policy objects.


Just as an aside, these internal work methods hve cascading effects on a bunch of states, all covered by the control lock. It would be interesting to see which threads contend for this lock and whether there are circumstances under which the lock might become hot of contended.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 125:

> 123:     if (request.cause == GCCause::_shenandoah_concurrent_gc) {
> 124:       request.generation = _heap->young_generation();
> 125:       _heap->clear_cancelled_gc(false);

label name of parameter to the call to help the reader.  `/* clear_oom_handler */`

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 717:

> 715: 
> 716: void ShenandoahGenerationalControlThread::notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation) {
> 717:   assert(_control_lock.is_locked(), "Request lock must be held here");

Somewhat paranoid suggestion: I'd use the stronger, if slightly more expensive, `owned_by_self()` rather than the weaker `is_locked()`. You can alternatively use `assert_lock_strong(...)`.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 720:

> 718:   log_debug(gc, thread)("Notify control (%s): %s, %s", gc_mode_name(gc_mode()), GCCause::to_string(cause), generation->name());
> 719:   _requested_gc_cause = cause;
> 720:   _requested_generation = generation;

This _may_ be a good spot to add any potential invariant (sanity) checks for the consistency of `cause` and `generation` to catch any potential issues.

For example, I might check I am not overwriting legit previous requests (?), and that any new request I am creating are sensible.

Such assertion/invariant checks may be relegated into a work method to avoid cluttering this method.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 730:

> 728: 
> 729: void ShenandoahGenerationalControlThread::notify_cancellation(MonitorLocker& ml, GCCause::Cause cause) {
> 730:   assert(_heap->cancelled_gc(), "GC should already be cancelled");

Not sure about this, but if cancellations can happen only when `request*` fields are clear (or may be not?), then this would be a good place to do such invariant checks (much as the suggestion above  in `notify_control_thread()`.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.cpp line 795:

> 793:   if (_mode != new_mode) {
> 794:     log_debug(gc, thread)("Transition from: %s to: %s", gc_mode_name(_mode), gc_mode_name(new_mode));
> 795:     EventMark event("Control thread transition from: %s, to %s", gc_mode_name(_mode), gc_mode_name(new_mode));

As in my suggestions above for `notify_control_thread` and `notify_cancellation`, this might be an opportune time to do any sanity/consistency checks for `gc_mode` updates (wrt the other two `_request*` fields.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 75:

> 73: 
> 74:   // The mode is read frequently by requesting threads and only ever written by the control thread.
> 75:   volatile GCMode _mode;

A bit of a nit: Any reason not to just call it `_gc_mode` which seems now to be its _de facto_ name due to its accessor method name and how it's referred to in comments?

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 132:

> 130: 
> 131:   void set_gc_mode(GCMode new_mode);
> 132:   void set_gc_mode(MonitorLocker& ml, GCMode new_mode);

// Set gc mode under lock and post a notification. The second variant is called from
// contexts where the lock is already held.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 135:

> 133:   static const char* gc_mode_name(GCMode mode);
> 134: 
> 135:   // Takes the request lock and updates the requested cause and generation, then notifies the control thread.

// The second variant is used from contexts where the lock is already held.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 139:

> 137:   void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation);
> 138: 
> 139:   // Notifies the control thread, but does not update the requested cause or generation.

```// The second variant ...```

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 143:

> 141:   void notify_cancellation(MonitorLocker& ml, GCCause::Cause cause);
> 142: 
> 143:   void maybe_set_aging_cycle();

1 line documentation comment for the private APIs.

src/hotspot/share/gc/shenandoah/shenandoahGenerationalControlThread.hpp line 150:

> 148:   GCMode prepare_for_explicit_gc_request(ShenandoahGCRequest &request);
> 149: 
> 150:   GCMode prepare_for_concurrent_gc_request(ShenandoahGCRequest &request);

Documentation of private APIs.

Bit of a nit: These all seem to take a request type, fill in some yet unfilled fields, and return a GC mode. So they look to me like `prepare_request_for_<action>` rather than `pepare_for_<action>_request`.

src/hotspot/share/gc/shenandoah/shenandoahHeap.hpp line 443:

> 441: 
> 442: public:
> 443:   // Returns true if and only if cancellation request was successfully communicated.

This comment should probably go at line 455, and here you could say ```// Has GC been cancelled?```

src/hotspot/share/gc/shenandoah/shenandoahHeap.hpp line 687:

> 685:   HeapWord* allocate_from_gclab_slow(Thread* thread, size_t size);
> 686:   HeapWord* allocate_new_gclab(size_t min_size, size_t word_size, size_t* actual_size);
> 687:   bool retry_allocation(size_t original_full_gc_count) const;

It feels like this should be called `should_retry_allocation()`. Also a 1-line documentation comment: ``` // We want to retry an unsuccessful attempt at allocation until at least a full gc. ```

-------------

Marked as reviewed by ysr (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23475#pullrequestreview-2642645029
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970692572
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970689728
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970730527
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970747110
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970750529
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970752575
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970669392
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970673227
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970754002
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970754196
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970674398
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970678568
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970758140
PR Review Comment: https://git.openjdk.org/jdk/pull/23475#discussion_r1970761571


More information about the hotspot-gc-dev mailing list