RFR: 8350621: Code cache stops scheduling GC

Mon Apr 28 19:50:26 UTC 2025

The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`.

This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling.

Unfortunately this can't work properly under certain circumstances.
For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)).

I have observed this behavior on JVM in version 21 that were migrated recently from java 17.
Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen.

I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.

In order to reproduce this issue, I found a very simple and convenient way:

public class CodeCacheMain {
    public static void main(String[] args) throws InterruptedException {
        while (true) {
            Thread.sleep(100);
        }
    }
}

Run this simple app with the following JVM flags:

-Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15

- 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC
- low `ReservedCodeCacheSize` to put pressure on code cache quickly
- `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction

Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will:
- allows us to monitor code cache
- indirectly generate activity on the code cache, just what we need to reproduce the bug

Some logs related to code cache will show up at some point with GC activity:

[648.733s][info][codecache      ] Triggering aggressive GC due to having only 14.970% free memory

And then it will stop and we'll end up with the following message:

[672.714s][info][codecache      ] Code cache is full - disabling compilation

Leaving the JVM in an unstable situation.

I considered a few different options before making this change:
1) Always call `Universe::heap()->collect(...)` without making any check (the GC impl should handle the situation)
2) Fix all GCs implementation to ensure `_unloading_threshold_gc_requested` gets back to `false` at some point (probably what is supposed to happen today)
3) Change `CollectedHeap::collect` to return a `bool` instead of `void` to indicate if GC was run or scheduled

But I discarded them:
1) Dumb option that I used to check that the bug would be corrected, but will probably put a bit of pressure on resources when allocation need to be performed at code cache level (as it will be called at each allocation attempt). In addition, the log indicating that we trigger GC is spammed, not easy to decide how to handle the log correctly.
2) This option is possible and was my favorite up to some point. GC's implementation can have quite a lot of branches and it can be difficult to ensure we don't forget a case when to reset the flag. This could eventually be a solution to be explored in addition to the solution I propose in the PR. We could introduce a static method in `CodeCache` that would let a GC implementation to just reset the flag in a case the GC will not actually run for example (to be discussed)
3) I explored this solution, but it adds quite a lot of changes and is risky in the long term (in my opinion). G1GC already has a [G1CollectedHeap::try_collect](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1870) method that returns a `bool`, but this bool is `true` even when the GC is not run.

As a result, I decided to simply add a way for `CodeCache` to recover from this situation. The idea is to let the GC code as-is but keep in memory the time of the last GC request and reset the flag to `false` if it was not reset in a certain amount of time (250ms in my PR). This should only be helpful in corner cases where the GC impl has not reset the flag by itself.

Among the advantages of this solution: it gives a security to recover from a situation that may be created by changes in GC implementation, because someone forgot to take care about code cache.

I took a lot of time investigating this issue and exploring solutions, and am willing to take any input on it as it is my first PR on the project.

-------------

Commit messages:
 - _unloading_gc_requested should remain volatile
 - remove early returns from gc_on_allocation
 - fix race condition in try_to_gc
 - log before GC
 - fix log message
 - XXXXXXX: Fix code cache GC

Changes: https://git.openjdk.org/jdk/pull/23656/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23656&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8350621
  Stats: 77 lines in 2 files changed: 45 ins; 10 del; 22 mod
  Patch: https://git.openjdk.org/jdk/pull/23656.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23656/head:pull/23656

PR: https://git.openjdk.org/jdk/pull/23656