RFR: 8350621: Code cache stops scheduling GC

Fri May 2 18:41:53 UTC 2025

On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob <duke at openjdk.org> wrote:

> The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`.
> 
> This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling.
> 
> Unfortunately this can't work properly under certain circumstances.
> For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)).
> 
> I have observed this behavior on JVM in version 21 that were migrated recently from java 17.
> Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen.
> 
> I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.
> 
> In order to reproduce this issue, I found a very simple and convenient way:
> 
> 
> public class CodeCacheMain {
>     public static void main(String[] args) throws InterruptedException {
>         while (true) {
>             Thread.sleep(100);
>         }
>     }
> }
> 
> 
> Run this simple app with the following JVM flags:
> 
> 
> -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15
> 
> 
> - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC
> - low `ReservedCodeCacheSize` to put pressure on code cache quickly
> - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction
> 
> Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will:
> - allows us to monitor code cache
> - indirectly generate activity on the code cache, just what we need to reproduce the bug
> 
> Some logs related to code cache will show up at some point with GC activity:
> 
> 
> [648.733s][info][codecache      ] Triggering aggressive GC due to having only 14.970% free memory
> 
> 
> And then it will stop and we'll end up with the following message:
> 
> 
> [672.714s][info][codecache      ] Code cache is full - disabling compilation
> 
> 
> L...

I have a question regarding the existing code/logic.

    // In case the GC is concurrent, we make sure only one thread requests the GC.
    if (Atomic::cmpxchg(&_unloading_threshold_gc_requested, false, true) == false) {
      log_info(codecache)("Triggering aggressive GC due to having only %.3f%% free memory", free_ratio * 100.0);
      Universe::heap()->collect(GCCause::_codecache_GC_aggressive);
    }

Why making sure only one thread calls `collect(...)`? I believe this API can be invoked concurrently.

Would removing `_unloading_threshold_gc_requested` resolve this problem?

> I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.

For ParallelGC, `ParallelScavengeHeap::collect` contains the following to ensure `System.gc` gccause and similar ones guarantee a full-gc.

    if (!GCCause::is_explicit_full_gc(cause)) {
      return;
    }

However, the current logic that a young-gc can cancel a full-gc (`_codecache_GC_aggressive` in this case) also seems surprising.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-2847860414