RFR: 8350621: Code cache stops scheduling GC

Mon Apr 28 19:50:26 UTC 2025

On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob <duke at openjdk.org> wrote:

> The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`.
> 
> This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling.
> 
> Unfortunately this can't work properly under certain circumstances.
> For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)).
> 
> I have observed this behavior on JVM in version 21 that were migrated recently from java 17.
> Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen.
> 
> I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well.
> 
> In order to reproduce this issue, I found a very simple and convenient way:
> 
> 
> public class CodeCacheMain {
>     public static void main(String[] args) throws InterruptedException {
>         while (true) {
>             Thread.sleep(100);
>         }
>     }
> }
> 
> 
> Run this simple app with the following JVM flags:
> 
> 
> -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15
> 
> 
> - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC
> - low `ReservedCodeCacheSize` to put pressure on code cache quickly
> - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction
> 
> Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will:
> - allows us to monitor code cache
> - indirectly generate activity on the code cache, just what we need to reproduce the bug
> 
> Some logs related to code cache will show up at some point with GC activity:
> 
> 
> [648.733s][info][codecache      ] Triggering aggressive GC due to having only 14.970% free memory
> 
> 
> And then it will stop and we'll end up with the following message:
> 
> 
> [672.714s][info][codecache      ] Code cache is full - disabling compilation
> 
> 
> L...

Here is a log sample that shows how it behaves when the bug occurs. Logs starting with `>>>` are some logs I added when working on steps to reproduce.
The bug occurs at ~648.762s, G1GC has reset the flag to `false` but is still running, CodeCache has called `Universe::heap()->collect(...)`, which was discarded because of the current GC routine.
Note that during this test `-XX:StartAggressiveSweepingAt` was set to `25` instead of `15` but I confirm I can reproduce with `15` as well (as explained in the description of the PR)

[648.733s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 0)
[648.733s][info][codecache      ] Triggering aggressive GC due to having only 24.970% free memory
[648.733s][info][gc,start       ] GC(6210) Pause Young (CodeCache GC Aggressive)
[648.733s][info][gc,heap        ] GC(6210) PSYoungGen: 2851K(132096K)->224K(132096K) Eden: 2851K(131584K)->0K(131584K) From: 0K(512K)->224K(512K)
[648.733s][info][gc,heap        ] GC(6210) ParOldGen: 11691K(349696K)->11691K(349696K)
[648.733s][info][gc,metaspace   ] GC(6210) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.733s][info][gc             ] GC(6210) Pause Young (CodeCache GC Aggressive) 14M->11M(470M) 0.238ms
[648.733s][info][gc,cpu         ] GC(6210) User=0.00s Sys=0.00s Real=0.00s
[648.733s][info][gc,start       ] GC(6211) Pause Full (CodeCache GC Aggressive)
[648.733s][info][gc,phases,start] GC(6211) Marking Phase
[648.742s][info][gc,phases      ] GC(6211) Marking Phase 8.585ms
[648.742s][info][gc,phases,start] GC(6211) Summary Phase
[648.742s][info][gc,phases      ] GC(6211) Summary Phase 0.009ms
[648.742s][info][gc,phases,start] GC(6211) Adjust Roots
[648.742s][info][gc,phases      ] GC(6211) Adjust Roots 0.311ms
[648.742s][info][gc,phases,start] GC(6211) Compaction Phase
[648.747s][info][gc,phases      ] GC(6211) Compaction Phase 4.701ms
[648.747s][info][gc,phases,start] GC(6211) Post Compact
[648.747s][info][codecache      ] >>> CodeCache::update_cold_gc_count _unloading_threshold_gc_requested = false
[648.747s][info][codecache      ] Code cache critically low; use aggressive aging
[648.747s][info][gc,phases      ] GC(6211) Post Compact 0.106ms
[648.747s][info][gc,heap        ] GC(6211) PSYoungGen: 224K(132096K)->0K(132096K) Eden: 0K(131584K)->0K(131584K) From: 224K(512K)->0K(512K)
[648.747s][info][gc,heap        ] GC(6211) ParOldGen: 11691K(349696K)->11688K(349696K)
[648.747s][info][gc,metaspace   ] GC(6211) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.747s][info][gc             ] GC(6211) Pause Full (CodeCache GC Aggressive) 11M->11M(470M) 13.799ms
[648.747s][info][gc,cpu         ] GC(6211) User=0.11s Sys=0.00s Real=0.01s
[648.747s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 0)
[648.747s][info][codecache      ] Triggering aggressive GC due to having only 24.865% free memory
[648.747s][info][gc,start       ] GC(6212) Pause Young (CodeCache GC Aggressive)
[648.748s][info][gc,heap        ] GC(6212) PSYoungGen: 2851K(132096K)->224K(132096K) Eden: 2851K(131584K)->0K(131584K) From: 0K(512K)->224K(512K)
[648.748s][info][gc,heap        ] GC(6212) ParOldGen: 11688K(349696K)->11688K(349696K)
[648.748s][info][gc,metaspace   ] GC(6212) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.748s][info][gc             ] GC(6212) Pause Young (CodeCache GC Aggressive) 14M->11M(470M) 0.257ms
[648.748s][info][gc,cpu         ] GC(6212) User=0.00s Sys=0.00s Real=0.00s
[648.748s][info][gc,start       ] GC(6213) Pause Full (CodeCache GC Aggressive)
[648.748s][info][gc,phases,start] GC(6213) Marking Phase
[648.756s][info][gc,phases      ] GC(6213) Marking Phase 8.512ms
[648.756s][info][gc,phases,start] GC(6213) Summary Phase
[648.756s][info][gc,phases      ] GC(6213) Summary Phase 0.007ms
[648.756s][info][gc,phases,start] GC(6213) Adjust Roots
[648.757s][info][gc,phases      ] GC(6213) Adjust Roots 0.331ms
[648.757s][info][gc,phases,start] GC(6213) Compaction Phase
[648.761s][info][gc,phases      ] GC(6213) Compaction Phase 4.734ms
[648.761s][info][gc,phases,start] GC(6213) Post Compact
[648.761s][info][codecache      ] >>> CodeCache::update_cold_gc_count _unloading_threshold_gc_requested = false
[648.761s][info][codecache      ] Code cache critically low; use aggressive aging
[648.761s][info][gc,phases      ] GC(6213) Post Compact 0.059ms
[648.761s][info][gc,heap        ] GC(6213) PSYoungGen: 224K(132096K)->0K(132096K) Eden: 0K(131584K)->0K(131584K) From: 224K(512K)->0K(512K)
[648.761s][info][gc,heap        ] GC(6213) ParOldGen: 11688K(349696K)->11689K(349696K)
[648.761s][info][gc,metaspace   ] GC(6213) Metaspace: 7690K(8192K)->7690K(8192K) NonClass: 6904K(7168K)->6904K(7168K) Class: 786K(1024K)->786K(1024K)
[648.761s][info][gc             ] GC(6213) Pause Full (CodeCache GC Aggressive) 11M->11M(470M) 13.725ms
[648.761s][info][gc,cpu         ] GC(6213) User=0.09s Sys=0.02s Real=0.01s
[648.762s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 0)
[648.762s][info][codecache      ] Triggering aggressive GC due to having only 24.895% free memory
[648.762s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 1)
[648.762s][info][gc,start       ] GC(6214) Pause Young (GCLocker Initiated GC)
[648.762s][info][gc,heap        ] GC(6214) PSYoungGen: 1973K(132096K)->224K(132096K) Eden: 1973K(131584K)->0K(131584K) From: 0K(512K)->224K(512K)
[648.762s][info][gc,heap        ] GC(6214) ParOldGen: 11689K(349696K)->11689K(349696K)
[648.762s][info][gc,metaspace   ] GC(6214) Metaspace: 7691K(8192K)->7691K(8192K) NonClass: 6905K(7168K)->6905K(7168K) Class: 786K(1024K)->786K(1024K)
[648.762s][info][gc             ] GC(6214) Pause Young (GCLocker Initiated GC) 13M->11M(470M) 0.310ms
[648.762s][info][gc,cpu         ] GC(6214) User=0.00s Sys=0.00s Real=0.00s
[648.762s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 1)
** removed 278 occurrences of the same log **
[672.714s][info][codecache      ] >>> should start GC (_unloading_threshold_gc_requested = 1)
[672.714s][info][codecache      ] Code cache is full - disabling compilation
[672.714s][warning][codecache      ] CodeCache is full. Compiler has been disabled.
[672.714s][warning][codecache      ] Try increasing the code cache size using -XX:ReservedCodeCacheSize=
OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
CodeCache: size=2496Kb used=2479Kb max_used=2479Kb free=16Kb
 bounds [0x00007923ed490000, 0x00007923ed700000, 0x00007923ed700000]
 total_blobs=1127 nmethods=640 adapters=399
 compilation: disabled (not enough contiguous free space left)
              stopped_count=1, restarted_count=0
 full_count=1

Both JVM were started for ~20 minutes

### jconsole (reproducting the bug)

![image](https://github.com/user-attachments/assets/61db5c41-5ce8-4aad-ba98-dbbef142f420)

Started to misbehave at ~315.181s

### jconsole (with the fix from the PR)

![image](https://github.com/user-attachments/assets/f21dd6fc-7b7d-4b7c-a57a-86e60b2577ce)

[13.078s][debug][codecache   ] Previous GC request has not been reset after 13.018797s, force auto-reset
[412.985s][debug][codecache   ] Previous GC request has not been reset after 23.985252s, force auto-reset
[464.974s][debug][codecache   ] Previous GC request has not been reset after 7.970082s, force auto-reset
[524.953s][debug][codecache   ] Previous GC request has not been reset after 3.937477s, force auto-reset

Converted to draft: I would like to change it to ensure we log before calling `Universe::heap()->collect(...)` (same way as before)

Performing more tests on this (different configuration, different GC, ...), I noticed that I had a race condition when multiple threads enter the `try_to_gc` method I introduced.

The race condition impact was :
- an unwanted auto-reset of the flag
- an invalid "duration since last GC request" log
- an unneeded GC request

Possible in the following conditions:
- thread1: reads `_unloading_gc_requested_time` (with `elapsed_since_last_gc_request` > 250ms)
- thread2: has `_unloading_gc_requested == false` → it requests GC
- thread1: has `_unloading_gc_requested == true` → it resets `_unloading_gc_requested` + log + request GC 💥 

In order to avoid that I propose to simply ensure we don't have multiple threads performing the checks in `gc_on_allocation`

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-2661567656
PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-2661578204
PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-2662439874
PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-2664920007