G1 unexpectedly clamps eden length clamping for time
Thomas Schatzl
thomas.schatzl at oracle.com
Wed Feb 7 09:25:56 UTC 2024
Hi Danny,
On 07.02.24 03:40, Danny Thomas wrote:
> Hi folks,
>
> I'm looking into a report of expected reductions in the number of G1
> eden regions on our compute platform, where we use CFS on our shared
> compute tier.
>
> Reading through G1Policy it occurred to me that the
> measured cost_per_byte_ms is critical, and that if the measured workers
> were throttled, it would cause this weighted average to be artificially
> inflated, blowing out pause time estimates and clamping the eden length.
>
> Does my reading of the code and the log snippet below support that
> hypothesis, or can a trained eye spot something I'm missing?
indeed, the cost_per_byte_ms has a significant impact on G1's
estimation for the cost of the evacuating the eden. So if the copy cost
goes up, the less eden regions will be taken next time (it's a weighted
average). The copy cost is then multiplied with the survival rates.
There are other costs related to remembered sets, not sure if they are
very relevant here.
Since there is only one GC in that log it's a bit hard to diagnose
what's happening, but in that particular collection the amount of
survivor regions objects went up from previous (from 11 to 36), which
indicates that survival rates from eden changed significantly (given no
context, can't say this is a one-off or not).
More survivors cause base time go up. The
[93958.042s][1152][trace][gc,ergo,heap ] GC(2837) Predicted base time:
total 164.552708 lb_cards 86379 rs_length 50370 effective_scanned_cards
88850 card_merge_time 0.930841 card_scan_time 18.267386
constant_other_time 6.197170 survivor_evac_time 139.157311
line shows that "predicted base time" increased a lot from the start of
the gc, mostly related to "survivor_evac_time", i.e. time to evacuate
survivors (32ms -> 139ms, but also number of survivor regions went from
11 -> 36).
There is a discrepancy of the increase between survivor_evac_time and
number of survivor regions (former increases by 4.3x, the latter by
3.2x) but that one does not seem to be too bad.
This is a transition from space-reclamation phase (this is the last
mixed gc in that phase) to young-only phase
("[93958.042s][1152][debug][gc,ergo ] GC(2837) do not continue
mixed GCs (candidate old regions not available)"), so the predictors for
the young-only phase are used which might explain that a bit.
(Some predictors are dependent on the phase to predict for, see
G1Analytics, look for G1PhaseDependentSeq).
So overall given this information I think the issue is more likely an
unexpected amount of survivors during that collection than some
throttling (which may also be in play here, but one would need to
correlate that to CFS scheduling decisions somehow).
One could drill down about the changes between this collection and the
previous by analyzing the collection a bit more (with gc+phases=debug),
looking at the "Copied Bytes" item in the "Merge Per-Thread State"
section, and the sub-phases of "Evacuate Collection Set".
Hth,
Thomas
More information about the hotspot-gc-dev
mailing list