RFR: 8350621: Code cache stops scheduling GC

Tue Jun 24 08:57:31 UTC 2025

On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob <duke at openjdk.org> wrote:

> The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`.
> 
> This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling.
> 
> Unfortunately this can't work properly under certain circumstances.
> For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)).
> 
> I have observed this behavior on JVM in version 21 that were migrated recently from java 17.
> Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen.
> 
> I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.
> 
> In order to reproduce this issue, I found a very simple and convenient way:
> 
> 
> public class CodeCacheMain {
>     public static void main(String[] args) throws InterruptedException {
>         while (true) {
>             Thread.sleep(100);
>         }
>     }
> }
> 
> 
> Run this simple app with the following JVM flags:
> 
> 
> -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15
> 
> 
> - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC
> - low `ReservedCodeCacheSize` to put pressure on code cache quickly
> - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction
> 
> Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will:
> - allows us to monitor code cache
> - indirectly generate activity on the code cache, just what we need to reproduce the bug
> 
> Some logs related to code cache will show up at some point with GC activity:
> 
> 
> [648.733s][info][codecache      ] Triggering aggressive GC due to having only 14.970% free memory
> 
> 
> And then it will stop and we'll end up with the following message:
> 
> 
> [672.714s][info][codecache      ] Code cache is full - disabling compilation
> 
> 
> L...

> Why making sure only one thread calls `collect(...)`? I believe this API can be invoked concurrently.

> I have a question regarding the existing code/logic.
> 
> ```
>     // In case the GC is concurrent, we make sure only one thread requests the GC.
>     if (Atomic::cmpxchg(&_unloading_threshold_gc_requested, false, true) == false) {
>       log_info(codecache)("Triggering aggressive GC due to having only %.3f%% free memory", free_ratio * 100.0);
>       Universe::heap()->collect(GCCause::_codecache_GC_aggressive);
>     }
> ```
> 
> Why making sure only one thread calls `collect(...)`? I believe this API can be invoked concurrently.
> 
> Would removing `_unloading_threshold_gc_requested` resolve this problem?

It does, at the cost of many log messages:

[0.047s][info][gc          ] GC(0) Pause Young (Concurrent Start) (CodeCache GC Threshold) 2M->1M(512M) 4.087ms
[0.047s][info][gc,cpu      ] GC(0) User=0.01s Sys=0.00s Real=0.00s
[0.047s][info][gc          ] GC(1) Concurrent Mark Cycle
[0.047s][info][gc,marking  ] GC(1) Concurrent Scan Root Regions
[0.048s][info][codecache   ] Triggering threshold (7.654%) GC due to allocating 48.973% since last unloading (0.000% used -> 48.973% used)
[0.048s][info][gc,marking  ] GC(1) Concurrent Scan Root Regions 0.147ms
[0.048s][info][gc,marking  ] GC(1) Concurrent Mark
[0.048s][info][gc,marking  ] GC(1) Concurrent Mark From Roots
[0.048s][info][codecache   ] Triggering threshold (7.646%) GC due to allocating 49.028% since last unloading (0.000% used -> 49.028% used)
[0.048s][info][codecache   ] Triggering threshold (7.646%) GC due to allocating 49.028% since last unloading (0.000% used -> 49.028% used)
[0.048s][info][codecache   ] Triggering threshold (7.633%) GC due to allocating 49.114% since last unloading (0.000% used -> 49.114% used)
[0.049s][info][gc,task     ] GC(1) Using 6 workers of 6 for marking
[0.049s][info][codecache   ] Triggering threshold (7.625%) GC due to allocating 49.169% since last unloading (0.000% used -> 49.169% used)
[0.049s][info][codecache   ] Triggering threshold (7.616%) GC due to allocating 49.224% since last unloading (0.000% used -> 49.224% used)

[...repeated 15 times...]

[0.063s][info][codecache   ] Triggering threshold (7.527%) GC due to allocating 49.820% since last unloading (0.000% used -> 49.820% used)
[0.065s][info][codecache   ] Triggering threshold (7.519%) GC due to allocating 49.875% since last unloading (0.000% used -> 49.875% used)
[0.067s][info][codecache   ] Triggering threshold (7.511%) GC due to allocating 49.930% since last unloading (0.000% used -> 49.930% used)
[0.068s][info][gc,marking  ] GC(1) Concurrent Mark From Roots 20.256ms
[0.068s][info][gc,marking  ] GC(1) Concurrent Preclean
[0.068s][info][gc,marking  ] GC(1) Concurrent Preclean 0.016ms
[0.068s][info][gc,start    ] GC(1) Pause Remark

As you can see this is very annoying, particularly if the marking takes seconds all the while compiling is in progress.

> 
> > I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.
> 
> For ParallelGC, `ParallelScavengeHeap::collect` contains the following to ensure `System.gc` gccause and similar ones guarantee a full-gc.
> 
> ```
>     if (!GCCause::is_explicit_full_gc(cause)) {
>       return;
>     }
> ```
> 
> However, the current logic that a young-gc can cancel a full-gc (`_codecache_GC_aggressive` in this case) also seems surprising.

That's a different issue.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-2999414442