G1 unexpectedly clamps eden length clamping for time

Wed Feb 7 09:25:56 UTC 2024

Hi Danny,

On 07.02.24 03:40, Danny Thomas wrote:
> Hi folks,
> 
> I'm looking into a report of expected reductions in the number of G1 
> eden regions on our compute platform, where we use CFS on our shared 
> compute tier.
> 
> Reading through G1Policy it occurred to me that the 
> measured cost_per_byte_ms is critical, and that if the measured workers 
> were throttled, it would cause this weighted average to be artificially 
> inflated, blowing out pause time estimates and clamping the eden length.
> 
> Does my reading of the code and the log snippet below support that 
> hypothesis, or can a trained eye spot something I'm missing?

   indeed, the cost_per_byte_ms has a significant impact on G1's 
estimation for the cost of the evacuating the eden. So if the copy cost 
goes up, the less eden regions will be taken next time (it's a weighted 
average). The copy cost is then multiplied with the survival rates.

There are other costs related to remembered sets, not sure if they are 
very relevant here.

Since there is only one GC in that log it's a bit hard to diagnose 
what's happening, but in that particular collection the amount of 
survivor regions objects went up from previous (from 11 to 36), which 
indicates that survival rates from eden changed significantly (given no 
context, can't say this is a one-off or not).

More survivors cause base time go up. The

[93958.042s][1152][trace][gc,ergo,heap  ] GC(2837) Predicted base time: 
total 164.552708 lb_cards 86379 rs_length 50370 effective_scanned_cards 
88850 card_merge_time 0.930841 card_scan_time 18.267386 
constant_other_time 6.197170 survivor_evac_time 139.157311

line shows that "predicted base time" increased a lot from the start of 
the gc, mostly related to "survivor_evac_time", i.e. time to evacuate 
survivors (32ms -> 139ms, but also number of survivor regions went from 
11 -> 36).

There is a discrepancy of the increase between survivor_evac_time and 
number of survivor regions (former increases by 4.3x, the latter by 
3.2x) but that one does not seem to be too bad.

This is a transition from space-reclamation phase (this is the last 
mixed gc in that phase) to young-only phase 
("[93958.042s][1152][debug][gc,ergo       ] GC(2837) do not continue 
mixed GCs (candidate old regions not available)"), so the predictors for 
the young-only phase are used which might explain that a bit.

(Some predictors are dependent on the phase to predict for, see 
G1Analytics, look for G1PhaseDependentSeq).

So overall given this information I think the issue is more likely an 
unexpected amount of survivors during that collection than some 
throttling (which may also be in play here, but one would need to 
correlate that to CFS scheduling decisions somehow).

One could drill down about the changes between this collection and the 
previous by analyzing the collection a bit more (with gc+phases=debug), 
looking at the "Copied Bytes" item in the "Merge Per-Thread State" 
section, and the sub-phases of "Evacuate Collection Set".

Hth,
   Thomas