RFR: 8359348: G1: Improve cpu usage measurements for heap sizing [v3]
Thomas Schatzl
tschatzl at openjdk.org
Mon Jul 28 11:29:59 UTC 2025
On Thu, 24 Jul 2025 12:05:31 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:
>> Another issue seems to be that since both `record_pause_start()` and `record_pause()` are called for all pauses. It seems that this implementation only considers the concurrent CPU usage between Remark/Cleanup and Young pauses as G1 currently only resizes at Young collections.
>>
>> The bulk of concurrent work is from Young -> Remark and Remark -> Cleanup though. If e.g. all of the concurrent cycles occurs between two Young GCs, that whole concurrent cycle seems to be lost in the calculation (or at least parts of the concurrent CPU usage)?
>>
>> Or am I missing something?
>
> The assumption here is the `update_gc_pause_time_ratios` is called at the end of all GCs. The only issue I notice is that we don't call it in case of allocation_failures, which is problematic. But probably I don't understand the issue you are highlighting.
Actually I think there are two issues, let me copy from `G1Analytics::compute_pause_time_ratios`:
void G1Analytics::compute_pause_time_ratios(double end_time_sec, double pause_time_ms) {
[...]
double short_interval_ms = (end_time_sec - most_recent_gc_end_time_sec()) * 1000.0;
// This estimates the wall-clock time "lost" by application mutator threads due to concurrent GC
// activity. We do not account for contention on other shared resources such as memory bandwidth and
// caches, therefore underestimate the impact of the concurrent GC activity on mutator threads.
uint num_cpus = (uint)os::active_processor_count();
num_cpus = MIN2(num_cpus, ConcGCThreads);
double concurrent_gc_impact_time = _concurrent_gc_cpu_time_ms / num_cpus;
_short_term_pause_time_ratio = (pause_time_ms + concurrent_gc_impact_time) / short_interval_ms;
_short_term_pause_time_ratio = clamp(_short_term_pause_time_ratio, 0.0, 1.0);
}
* the `short_term_pause_time_ratio` only takes the most recent `_concurrent_gc_cpu_time_ms` into account in this calculation. Between two Young GCs there may have been a complete concurrent cycle, which means that `_concurrent_gc_impact_time` only contains the concurrent work from the end of the Cleanup pause to the current Young GC. It does not seem to contain the concurrent work from the previous Young GC to the Cleanup pause, as `G1Policy::record_pause_start_time` _overwrites_ the last `G1Analytics::_concurrent_gc_cpu_time_ms`.
Just giving an idea how much concurrent time is missing here, running BRT (obviously this is to highlight the issue), printing the measured concurrent CPU time in `G1Policy::record_pause_start_time`:
[45.341s][info ][gc,start ] GC(11) Pause Young (Concurrent Start) (G1 Evacuation Pause)
[45.341s][debug][gc ] GC(11) conc cpu time: 8319.61 // This one is not really relevant here.
[...]
[48.821s][info ][gc,start ] GC(12) Pause Remark
[48.821s][debug][gc ] GC(12) conc cpu time: 11763.96
[...]
[50.146s][info ][gc,start ] GC(12) Pause Cleanup
[50.146s][debug][gc ] GC(12) conc cpu time: 5830.75
[...]
[54.987s][info ][gc,start ] GC(13) Pause Young (Prepare Mixed) (G1 Evacuation Pause)
[54.987s][debug][gc ] GC(13) conc cpu time: 2780.92
So in that last Young GC pause, only around 14% of actual concurrent CPU time between the most recent two Young GCs is is used to calculate short term pause time ratio afaict.
Looking a bit into the code, I think this is somewhat of a pre-existing issue. We always calculate short term pause time ratio based off the last GC pause regardless of type, i.e. in this case from Cleanup to the next Young. I am kind of doubting that this is correct right now.
Maybe this should be improved before this change?
Some additional issue:
* the `long_term_pause_time_ratio` does not take time taken by concurrent work into account afaict. Maybe it's hard, but there should be an explanation why.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26351#discussion_r2235731832
More information about the hotspot-gc-dev
mailing list