RFR: 8317755: G1: Periodic GC interval should test for the last whole heap GC

Fri Oct 20 16:01:56 UTC 2023

Don’t we have already have a GCInterval with a default value of Long.MAX_VALUE?

When I hear the word idle I immediately start thinking, CPU idle. In this case however, I quickly shifted to memory idle which I think translates nicely into how idle are the allocators. Thus basing heap sizing ergonomics on allocation rates seems like a reasonable metric until you consider the edge cases. The most significant “edge case” IME is when GC overheads start exceeding 20%. In those cases GC will throttle allocations and that in turn cases ergonomics to reduce sizes instead of increasing them (to reduce GC overhead). My conclusion from this is that ergonomics should consider both allocation rates and GC overhead when deciding on how to resize heap at the end of a collection. Fortunately there are a steady stream of GC events that create a convenient point in time to make an ergonomic adjustment. Not having allocations and as a result not having the collector run implies one has to manufacture a convenient point in time to make an ergonomic sizing decision. Unfortunately time based triggers are speculative and the history of speculatively triggered GC cycles has been less than wonderful (think DGC as but one case). My last consulting engagement prior to joining MS involved me tuning an application where the OEM’s recommended configuration (set out of the box) was to run a full collection every two minutes. As one can imagine, the results were devastating. It took 3 days of meetings with various stakeholders and managers to get permission to turn that setting turned off.

If the application is truly idle then it’s a no foul no harm situation. However there are entire classes of applications that are light allocators and the question would be, what would be the impact of speculative collections on those applications?

As for returning memory, two issues, there appears to be no definition for “unused memory”. Secondly, what I can say after looking at 1000s of GC logs is that the amount of floating garbage that G1 leaves behind even after several concurrent cycles is not insignificant. I also write a G1 heap fragmentation viewer and what it revealed is that heap remains highly fragmented and scattered after each GC cycle. All this suggests that heap will need to be compacted with a full collection in order to return a significantly large enough block of memory to make the entire effort worthwhile. Again, if the application is idle, then no harm no foul. However, for those applications that are memory-idle but not CPU-idle this might not be a great course of action.

In my mind, any trigger for a speculative collection would need to take into consideration, allocation rates, GC overhead, and mutator busyness (for cases when GC and allocator activity is low to 0).

Kind regards,
Kirk

> On Oct 19, 2023, at 10:43 AM, Aleksey Shipilev <shade at openjdk.org> wrote:
> 
> On Wed, 18 Oct 2023 14:24:44 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> 
>>> See the description in the bug. Fortunately, we already track the last whole-heap GC. The new regression test verifies the behavior.
>>> 
>>> Additional testing:
>>> - [x] Linux x86_64 fastdebug `tier1 tier2 tier3`
>> 
>> Thanks for looking at it!
>> 
>> Re-reading JEP 346: “Promptly Return Unused Committed Memory from G1”...
>> 
>> In the scenario we are seeing, we do have lots of unused committed memory that would not be reclaimed promptly until concurrent cycle executes. The need for that cleanup is in worst case driven by heap connectivity changes that do not readily reflect in other observable heap metrics. A poster example would be a cache sitting perfectly in "oldgen", eventually dropping the entries, producing a huge garbage patch. We would only discover this after whole heap marking. In some cases, the tuning based on occupancy (e.g. soft max heap size, heap reserve, etc.) would help if we promote enough stuff to actually trigger the concurrent cycle. But if we keep churning very efficient young collections, we would get there very slowly.
>> 
>> Therefore, I’d argue the current behavior is against _the spirit_ of the JEP 346, even though _the letter_ says that we track “any” GC for periodic GC. There is no explicit mention why young GCs should actually be treated as recent GCs, even though — with the benefit of hindsight — they throw away promptness guarantees. Aside: Shenandoah periodic GC does not make any claims it would run only in idle phases; albeit the story is simpler without young GCs. But Generational Shenandoah would follow the same route: the periodic whole heap GC would start periodically regardless of young collections running in between or not.
>> 
>> The current use of “idle” is also awkward in JEP 346. If user enables `G1PeriodicGCInterval` without doing anything else, they would get a GC even when application is churning at 100% CPU, but without any recent GC in sight. I guess we can think about “idle” as “GC is idle”, but then arguably not figuring out the whole heap situation _for hours_ can be described as “GC is idle”. I think the much larger point of the JEP is to reclaim memory promptly, which in turn requires whole heap GC. Looks like JEP somewhat painted itself in the corner by considering all GCs, including young.
>> 
>> I doubt that users would mind if we change the behavior of `G1PeriodicGCInterval` like this: the option is explicitly opt-in, the configurations I see in prod are running with huge intervals, etc. So we are asking for a relatively rare concurrent GC even when application is doing young GCs. But I agree that departing from the current behavior might still have undesired consequences, for which we need to plan the escape route. There is also a need to ...
> 
>> @shipilev : I have not made up my mind about the other parts of your proposal, but:
>> 
>>> The current use of “idle” is also awkward in JEP 346. If user enables G1PeriodicGCInterval without doing anything else, they would get a GC even when application is churning at 100% CPU, but without any recent GC in sight.
>> 
>> This is the reason for the `G1PeriodicGCSystemLoadThreshold` option and is handled by the feature/JEP.
> 
> Yes, that is why I said "without doing anything else". With that example, I wanted to point out that the definition of "idle" is already quite murky even with current JEP, where we have an additional option to tell if "idle" includes the actual system load. In this view, having another option that would tell if "idle" includes young GCs fits well, I think.
> 
> -------------
> 
> PR Comment: https://git.openjdk.org/jdk/pull/16107#issuecomment-1771405208