RFR: 8327000: GenShen: Integrate updated Shenandoah implementation of FreeSet into GenShen [v8]
Kelvin Nilsen
kdnilsen at openjdk.org
Tue Jun 25 15:31:35 UTC 2024
On Wed, 19 Jun 2024 20:56:04 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:
>> Hmm, interesting, it looks like this does indeed increment only at a full gc, or at least that's the intent. Should check that this is also respected in the generational case. Vide:
>>
>>
>> product(uintx, ShenandoahNoProgressThreshold, 5, EXPERIMENTAL, \
>> "After this number of consecutive Full GCs fail to make " \
>> "progress, Shenandoah will raise out of memory errors. Note " \
>> "that progress is determined by ShenandoahCriticalFreeThreshold") \
>> \
>
> I'm testing now a configuration that honors the intent of this comment, to only increment gc_no_progress count if "consecutive Full GCs" fail to make progress. This is not how it was implemented before, because our implementation has been also incrementing gc_no_progress count if degenerated fails to make progress, and we have not been resetting the count when we experience a productive concurrent GC. My interpretation of the intent is that a concurrent GC happening between two unproductive Full GCs does not count as two "consecutive" unproductive full GCs.
>
> If this change behaves well, we may be able to remove the ShenandoahNoProgressThreshold override on TestThreadFailure#generational.
In general, this new configuration works well and passes all GHA and all CI/CD pipeline tests. However, I tried removing the ShenandoanNoProgressThreshold override from this test, and it fails. In one execution that I carefully analyzed, the behavior is:
1. For NastyThread-0 through NastyThread-12, we perform a FullGC (which has good progress) but the good progress is not enough to satisfy the failed allocation request so we throw OOM.
2. With NastyThread-12, we do not fail fast. GC(127) is concurrent young. GC(128) through GC(132) are Full GCs, each with Bad Progress, but each yielding enough free memory to satisfy at least one additional allocation by NastyThread-12.
3. GC(133) is a full GC also with bad progress. This time, the bad progress is not enough to satisfy the pending alloc request (for 4112 bytes), so we thrown OOM.
4. At this point, we have experienced 5 (Default value of ShenandoaohNoProgressThreshold) consecutive full GCs with no progress, so when the main thread attempts to allocate NastyThread-13 after joining with NastyThread-12, it does not even bother to attempt a Full GC. It just immediately throws OOM.
5. This causes the test to fail, because main is not "supposed" to experience OOM.
Another complication is that the failure doesn't always happen with NastyThread-13. Sometimes it happens with NastyThread-5. GC degradation is not cumulative. Each NastyThread is supposed to start with a clean slate (after a Full GC reclaims all previously allocated memory).
And finally, I have observed that this test will still occasionally fail even with the ShenandoahNoProgressThreshold=24 override.
So I'm puzzling a bit over why GenShen inherently needs a larger value of ShenandoahNoProgressThreshold than traditional Shenandoah in order to pass this test. I think the explanation is that GenShen introduces more "heap fragmentation" between OLD and YOUNG generations. Full GC can help sift out these fragments of memory so that "smaller" allocation requests can still succeed, even though Full GC reports "bad progress".
My current thought is to apply one more tweak to the GenShen behavior: If a pending allocation succeeds following a Full GC, I am inclined to count this as "good progress", regardless of what the other metrics think about progress. That we were able to allocate following Full GC and were not able to allocate before Full GC is the ultimate measure of "good progress". Will experiment with this.
-------------
PR Review Comment: https://git.openjdk.org/shenandoah/pull/440#discussion_r1653045650
More information about the shenandoah-dev
mailing list