RFR: 8357445: G1: Time-Based Heap Uncommit During Idle Periods [v4]

Tue Sep 2 10:10:02 UTC 2025

On Thu, 31 Jul 2025 21:13:37 GMT, Monica Beckwith <mbeckwit at openjdk.org> wrote:

>> One reason why I do not see the need for these flags to be manageable is that this heap shrinking is a supportive function if the application is "idle". Main heap sizing should be done by the existing heap sizing/AHS at GC events.
>> 
>> From the CR text:
>>>  Doing heap sizing based on garbage collections has a big disadvantage: if there are no garbage collections (due to no application activity) there is a risk that a large amount of heap is kept committed unnecessarily *for a long time*. 
>> (emphasis mine)
>> 
>> If we are talking about "a long time", so there does not seem to be a need for changing it during runtime (or change it at all). It should not matter if that "long time" is "long time +- small epsilon", and so, allowing dynamic change of "a long time" to another "long time" seems unnecessary without a very good use case.
>> 
>> Please consider not necessarily the current situation, but with "full" AHS.
>> 
>> Another question is whether you had thoughts about the interaction with (JDK-8213198)[https://bugs.openjdk.org/browse/JDK-8213198], as this change seems to be a subset of the other. (That's just curiosity from me, I think this feature is useful as is, and if the other ever materializes, we can always reconsider).
>> 
>> Otoh decreasing the heap by this mechanism will eventually trigger a marking.
>
> @tschatzl Thanks for the detailed feedback! 
> Flag Changes (per your feedback):
> 
> - G1UseTimeBasedHeapSizing is now diagnostic and enabled by default (was experimental/disabled).
> - G1MinRegionsToUncommit is now diagnostic (was experimental).
> - Timing flags (G1UncommitDelayMillis, G1TimeBasedEvaluationIntervalMillis) remain manageable to support operational use cases.
> 
> 
> ## Manageable Flag Use Cases
> I can see 3 scenarios where runtime adjustment is valuable:
> 
> **High-Availability Services**
> - 24/7 operations cannot restart for tuning adjustments  
> - Memory pressure events require immediate response
> - Cost optimization demands dynamic resource adaptation
> 
> **Cloud & Container Platforms**  
> - Resource limits change dynamically (auto-scaling)
> - Multi-tenancy requires per-workload optimization
> - Cost efficiency drives aggressive memory reclaim
> 
> **DevOps & SRE Teams**
> - Incident response needs immediate memory reclaim
> - Performance testing requires runtime comparison of settings  
> - Capacity planning benefits from live tuning experiments
> 
> ## "Long Time" Consideration
> While the feature targets 'long idle periods,' production shows varied patterns where the difference between 5 minutes vs 30 seconds becomes critical - especially in container environments where exceeding memory limits means process termination, not just performance degradation.
> 
> ## JDK-8213198 Interaction  
> After reviewing that issue, I see they address **orthogonal problems**:
> 
> **JDK-8213198**: Active application, young GCs happening, needs mixed GCs for string table cleanup
> **JDK-8357445**: Idle application, no GCs happening, needs memory uncommit
> 
> **Operational States:**
> - String table issue: Active allocation + insufficient mixed GCs  
> - Time-based uncommit: Complete inactivity + no allocations
> 
> **Complementary Solutions:**
> - JDK-8213198: Triggers concurrent cycles when string table grows
> - JDK-8357445: Uncommits memory during idle periods
> - Future full AHS: Would orchestrate both mechanisms
> 
> The manageable flags become critical in the idle scenario where container memory limits create immediate pressure, unlike the string table scenario where growth can be tolerated for longer periods.
> 
> Would this clarification help with the flag classification decision?

> I can see 3 scenarios where runtime adjustment is valuable:
>
>High-Availability Services
>
>    24/7 operations cannot restart for tuning adjustments
>    Cost optimization demands dynamic resource adaptation
>    Memory pressure events require immediate response

Observing and reacting to outside memory pressure events is the purpose of the AHS JEP implementation PR. 

We should not introduce duplicate functionality and additional shared responsibilities (i.e. between end user and the AHS system) if possible.

The end user can always change reactivity via the `*GCIntensity` flags and/or `GCTimeRatio`/`SoftMaxHeapSize`. I do not see a point adding another layer of complexity here.

Let's first determine that the existing measures are insufficient before adding new knobs that are very hard to remove. It's easier to make things manageable later.

>
>Cloud & Container Platforms
>
>    Resource limits change dynamically (auto-scaling)

That's covered by the AHS change.

>    Multi-tenancy requires per-workload optimization
>    Cost efficiency drives aggressive memory reclaim

The AHS change/existing flags should cover that. Use different GCIntensity/GCTimeRatio/SoftMaxHeapSize.

>
>DevOps & SRE Teams
>
>    Incident response needs immediate memory reclaim

Use `jmap` and trigger a (concurrent) full gc. That's much more immediate and responsive. This functionality is not even guaranteed to free anything ever.

>    Performance testing requires runtime comparison of settings
>    Capacity planning benefits from live tuning experiments

If one only considers this change in isolation this is true. However the plan is to make this work in conjunction with "full" AHS, which has its own knobs.

>
>"Long Time" Consideration
>
>While the feature targets 'long idle periods,' production shows varied patterns where the difference between 5 >minutes vs 30 seconds becomes critical - especially in container environments where exceeding memory limits >means process termination, not just performance degradation.

AHS will continuously monitor free memory and container limits and adjust gc aggressiveness. One if its measures are uncommitting free regions.

>JDK-8213198 Interaction
>
>After reviewing that issue, I see they address orthogonal problems:
>
>JDK-8213198: Active application, young GCs happening, needs mixed GCs for string table cleanup
>JDK-8357445: Idle application, no GCs happening, needs memory uncommit
>
>Operational States:
>
>    String table issue: Active allocation + insufficient mixed GCs
>    Time-based uncommit: Complete inactivity + no allocations

They may not be completely related, but the string table issue (and in general non-java heap resources) require some form of garbage collection. Any garbage collection now triggers re-evaluation of heap sizes. So they are equal in the impact (they adjust the heap) - JDK-8213198 subsumes this one in practice.

Actually, you will get better results with JDK-8213198 because it can easily happen that the application is idle with very few free regions left. In this case this change will do nothing.

Also, for the pure idle case, there is the existing periodic gc functionality.

>Complementary Solutions:
>
>    JDK-8213198: Triggers concurrent cycles when string table grows

Yeah, I think that CR is defined too tightly for this use case. Together with [JDK-8317755](https://bugs.openjdk.org/browse/JDK-8317755) "G1: Periodic GC interval should test for the last whole heap GC", they show that there is a general problem with resource management on idle or close-to-idle machines.

If one takes that second issue into account, it becomes clear that JDK-8213198 should be about fixing letting collectable memory in general hanging around, not just the string table (and clearing the string table requires a GC which resizes the heap; if the machine is idle it will most likely shrink it).

Actually the more I think about this change, it becomes more and more attractive to implement JDK-8317755/JDK-8213198 to me as a more well-rounded solution; for costs, we are talking about an extra GC every "long idle time", with presumably much less code complexity (piggybacking on existing periodic gcs), and higher gains in footprint.

While this change is interesting because it is a very low-cost way of freeing unused memory, it is insufficient.

>    JDK-8357445: Uncommits memory during idle periods
>    Future full AHS: Would orchestrate both mechanisms
>
>The manageable flags become critical in the idle scenario where container memory limits create immediate >pressure, unlike the string table scenario where growth can be tolerated for longer periods.

* This change does not guarantee any shrinking.
* AHS will continuously monitor free memory; if that changes by any means (additional allocation, change of container limits), it needs to and will react regardless of idle state or not. Freeing free regions like this is one option it has.

>
>Would this clarification help with the flag classification decision?

The problem is that future full AHS is right around the corner according to the planning, either this or next release, and so adding lots of legacy functionality that takes multiple releases to remove is not something I would like to introduce.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26240#discussion_r2315496095