RFR: 8350621: Code cache stops scheduling GC [v2]

Thu Jul 10 20:56:43 UTC 2025

On Thu, 10 Jul 2025 13:05:30 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Hi all,
>> 
>>   please review this change to avoid CodeCache triggered GCs temporarily being ignored.
>> 
>> In particular, G1 does not make sure when its `collect()` method is called during a concurrent cycle, that a `Remark` pause that does code unloading etc. actually occurs after that request. This makes it so that some internal flag is not reset appropriately (via some callback to the caller). After this event, there will never be another attempt by the compiler threads to issue a garbage collection. The compiler threads are stuck until the next code unloading (caused by e.g. a regular concurrent cycle being triggered).
>> 
>> The [original PR](https://github.com/openjdk/jdk/pull/23656) also stated a similar with Parallel GC - however due to recent changes (https://bugs.openjdk.org/browse/JDK-8192647 and others) I do not think it is possible any more that the `collect()`call can silently be ignored. (I.e. gclocker could abort a full gc that is the only reason why code cache unloading and that callback will not occur as expected).
>> 
>> So for G1, the change makes the caller of `collect()` (i.e. the compiler thread) wait until that situation passes and retry. With that the compiler thread is sure that the callback the compiler threads are waiting for occurs. This does have the disadvantage that that compiler thread may not be available for compilation any more for a bit.
>> 
>> Testing: tier1-5, test case passing, failing before
>> 
>> Thanks,
>>   Thomas
>
> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision:
> 
>   * kbarret review:
>     - restructure code in `try_collect_concurrently`
>     - fix asserts - they only worked in the test because of whitebox being active
>     - fix comments
>   * threalmdoerr review:
>     - fix test to be more lenient

Changes requested by kbarrett (Reviewer).

src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1755:

> 1753:         old_marking_started_before = old_marking_started_after;
> 1754:       }
> 1755:     } else if (GCCause::is_codecache_requested_gc(cause) && op.marking_in_progress()) {

I wonder if this might be simpler if the codecache-request case were
completely handled by it's own clause. This would involve "duplicating" some
of the user-request case. But I think that could be done by chunking that into
a couple helper functions, bounded by the existing assertions in the
user-request case. So something like this (I don't love the names of the
helpers here):

if (is_codecache_requested_gc(cause)) {
  if (op.marking_in_progress()) {
    LOG... ;
    return true;
  }
  // is there some assert similar to the user-request assert that makes sense here?
  if (wait_for_completion(old_marking_started_before,
                          old_marking_started_after,
                          old_marking_completed_after)) {
    return true;
  }
  // is there some assert similar to the user-request assert that makes sense here?
  if (need_normal_retry(op)) {
    continue;
  }
}

And similarly in the user-request case.  And obviously with more comments :)

While looking at that, I noticed that the whitebox control stall in the
user-request case is doing exactly what the comment about it says not to do.
It is waiting for control to be released, while the comment says not to do
that. This of course means there could be multiple complete collections (and
not just STW full collections) before it can proceed. This seems to be in the
initial published webrev for JDK-8240239. The immediately preceding (draft,
non-public) webrev has that comment but different code, so it looks like the
code was changed but the comment wasn't updated. Bad Kim!

I will file a bug for that and assign it to myself. If you decide to adopt
something like the above suggested refactoring, just retain that code and
comment as-is. I'll see if I can recall what happened there, and update the
comment accordingly. Unfortunately, the (non-public) email discussion between
those versions doesn't make any mention of this change.

src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1757:

> 1755:     } else if (GCCause::is_codecache_requested_gc(cause) && op.marking_in_progress()) {
> 1756:         // For a CodeCache requested GC, before marking, progress is ensured as the
> 1757:         // following Remark pause unloads code (and signals the requestr such).

spelling: requestr

src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 1768:

> 1766:       // (1) the VMOp successfully performed a GC,
> 1767:       // (2) a concurrent cycle was already in progress and we were
> 1768:       //     before the Remark pause for CodeCache requested GCs,

I think this comment change is no longer needed.  Which is good, because I find it confusingly
ambiguous, but had trouble figuring out what it should say instead. I think the original is fine,
because we're in a non-codecache request context.

-------------

PR Review: https://git.openjdk.org/jdk/pull/26189#pullrequestreview-3006657505
PR Review Comment: https://git.openjdk.org/jdk/pull/26189#discussion_r2198721164
PR Review Comment: https://git.openjdk.org/jdk/pull/26189#discussion_r2198307729
PR Review Comment: https://git.openjdk.org/jdk/pull/26189#discussion_r2198312027