RFR: 8359348: G1: Improve cpu usage measurements for heap sizing [v3]

Tue Aug 19 12:57:40 UTC 2025

On Mon, 28 Jul 2025 10:17:46 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> The assumption here is the `update_gc_pause_time_ratios` is called at the end of all GCs. The only issue I notice is that we don't call it in case of allocation_failures, which is problematic. But probably I don't understand the issue you are highlighting.
>
> Actually I think there are two issues, let me copy from `G1Analytics::compute_pause_time_ratios`:
> 
> 
> void G1Analytics::compute_pause_time_ratios(double end_time_sec, double pause_time_ms) {
> [...]
>   double short_interval_ms = (end_time_sec - most_recent_gc_end_time_sec()) * 1000.0;
> 
>   // This estimates the wall-clock time "lost" by application mutator threads due to concurrent GC
>   // activity. We do not account for contention on other shared resources such as memory bandwidth and
>   // caches, therefore underestimate the impact of the concurrent GC activity on mutator threads.
>   uint num_cpus = (uint)os::active_processor_count();
>   num_cpus = MIN2(num_cpus, ConcGCThreads);
>   double concurrent_gc_impact_time = _concurrent_gc_cpu_time_ms / num_cpus;
> 
>   _short_term_pause_time_ratio = (pause_time_ms + concurrent_gc_impact_time) / short_interval_ms;
>   _short_term_pause_time_ratio = clamp(_short_term_pause_time_ratio, 0.0, 1.0);
> }
> 
> * the `short_term_pause_time_ratio` only takes the most recent `_concurrent_gc_cpu_time_ms` into account in this calculation. Between two Young GCs there may have been a complete concurrent cycle, which means that `_concurrent_gc_impact_time` only contains the concurrent work from the end of the Cleanup pause to the current Young GC. It does not seem to contain the concurrent work from the previous Young GC to the Cleanup pause, as `G1Policy::record_pause_start_time` _overwrites_ the last `G1Analytics::_concurrent_gc_cpu_time_ms`.
> 
> Just giving an idea how much concurrent time is missing here, running BRT (obviously this is to highlight the issue), printing the measured concurrent CPU time in `G1Policy::record_pause_start_time`:
> 
> [45.341s][info ][gc,start       ] GC(11) Pause Young (Concurrent Start) (G1 Evacuation Pause)
> [45.341s][debug][gc             ] GC(11) conc cpu time: 8319.61 // This one is not really relevant here.
> [...]
> [48.821s][info ][gc,start       ] GC(12) Pause Remark
> [48.821s][debug][gc             ] GC(12) conc cpu time: 11763.96
> [...]
> [50.146s][info ][gc,start       ] GC(12) Pause Cleanup
> [50.146s][debug][gc             ] GC(12) conc cpu time: 5830.75
> [...]
> [54.987s][info ][gc,start       ] GC(13) Pause Young (Prepare Mixed) (G1 Evacuation Pause)
> [54.987s][debug][gc             ] GC(13) conc cpu time: 2780.92
> 
> 
> So in that last Young GC pause, only around 14% of actual concurrent CPU time between the most recent two Young GCs is is used to calculate short term pause time ratio afaict.
> 
> Looking a bit into the code, I thi...

Fixed so that `long_term_pause_time_ratio` accumulation takes into account concurrent work

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26351#discussion_r2285167442