Configurable G1 heap expansion aggressiveness

Thu Feb 13 20:36:52 UTC 2025

Hi Kirk,

Thanks for the detailed answer, I appreciate your time on this.

Would you mind sharing more details on the work you are referring to
(Serial collector changes, the ongoing G1 work) so I could learn more
about it?
It sounds like the "hibernation" feature you are talking about is
different from Azul's CRaC that uses CRIU? Can you elaborate?

> That said, a tuning strategy for G1 is more complicated because the costs of transients is quite different in G1 than it is with the Serial/Parallel collectors. But I believe it is achievable using existing flags/structures and the addition of the SoftMaxHeapSize.
Can you share more on how SoftMaxHeapSize fits into this strategy?
Doesn't it require some "controller" that would dynamically adjust
SoftMaxHeapSize at runtime based on signals like GC CPU usage, VM
memory pressure etc?

> If I might add, in large homogenous deployments, you’d think you’d see a 1 size fits all optimal GC configuration. Unfortunately my look into this has shown that there are often multiple optimal configurations. The only way to combat this is  with smarter ergonomics in the runtime.
Thanks for the insight. I believe this to be true. My claim is that
for the majority of applications in certain domains (e.g. backend
services in multi-tenant environments running in the cloud) the
existing default G1 configuration and ergonomics work well only if the
max heap size is correctly sized (because of the greedy heap
expansion).
Sizing heaps is challenging at scale and the most common result is
setting max heap too high. This leads to a lot of resource waste that
is hard to detect and realize for many because "the heap is used".

I guess the question boils down to: is it worth exposing two more
internal G1 parameters in a "short term" as experimental tunnables to
allow some high leverage optimizations.
I think we agree on the cost of doing it (although it might be hard to
quantify): maintenance, potential misuse, additional complexity.
On a benefit side, the initial results suggest we could significantly
increase bin-packing of JVMs per VM because, when bin-packing, we have
to account for memory usage spikes due to temporary aggressive heap
expansions. Rough estimates suggest 30%-60% smaller memory spikes. At
large scale this could lead to big cost savings with a little effort.
Taking into account that the additional flags do not change existing
default behavior, so they would be completely transparent unless
someone decides to go down the rabbit hole of tuning experimental GC
flags, maybe it is something worth considering?

Best regards,
Jaroslaw

On Thu, Feb 13, 2025 at 6:35 AM Kirk Pepperdine <kirk at kodewerk.com> wrote:
>
> Hi Jaroslaw,
>
>
>
> On Feb 13, 2025, at 5:24 AM, Jaroslaw Odzga <jarek.odzga at gmail.com> wrote:
>
> Thank you Kirk and Thomas for your answers!
>
> What Kirk describes sounds great, is the right long term approach and
> I can't wait for it to be shipped. It also sounds like a feature we
> might need to wait for a while (please correct me if I am wrong).
>
>
> If you look at the ZGC code as a model I believe you’ll find that it’s something that can be achieved by making the appropriate adjustments to the ergonomics. So while the knowledge needed to make the changes is non-trivial, the actual coding effort isn’t something that makes this a “long term approach”.
>
> Our decision to focus on Serial was two fold. First, work on G1 is already taking place and given the progress there we thought best to focus on the Serial collector. This is because the Serial collector is default for small deployments which are fairly common. I personally see AHS to be a stepping stone to being able to “hibernate” idle JVMs, something that isn’t really possible at the moment. Being able to wake up a hibernated JVM should be far cheaper than spinning up a new one taking into account all of the container costs. The data that I’ve collected suggests that starting a JVM is only a small fraction of the total costs of spinning up a new container. And that doesn’t include warmup.
>
> The complication with the Serial collector is in how heap is structured and consequently, where data resides in memory after a collection cycle. We have rearranged where the generations reside so that ergonomics has the freedom resize individual generational spaces without having to take on the cost of copying data about to accommodate that resizing. This work will land as soon as I address Thomas’s concerns in the JBS.
>
> This work sets us up for the next steps which I believe should come more quickly now that we’ve set the foundation for it. What we’re looking to do is safely resize each generation according to its current needs while taking into account global memory pressure. In my experience, a lot more memory than is needed gets committed to Java heap simply to accommodate the current sizing policies. Resizing generational spaces individually allows us to end us with heap configurations that are currently unsafe. For example, it is common that GC log data tells me that Eden should be 2 or 3x the size of tenured. Currently, configuring Java heap to accommodate this need risks OOME being thrown or unnecessarily enlarging heap (Tenured) to safely allow for a much larger Eden. Getting this internal tuning right reduces both on GC overhead and memory footprint. This also allows us to easily completely collapse heap should a JVM become idle.
>
> While there are significant differences between G1 and the Serial collector, there are also similarities with the tuning strategies. In my opinion, the work needed for G1 is easier than it is for the Serial collector simply because of how Java heap is structured. That said, a tuning strategy for G1 is more complicated because the costs of transients is quite different in G1 than it is with the Serial/Parallel collectors. But I believe it is achievable using existing flags/structures and the addition of the SoftMaxHeapSize.
>
> If I might add, in large homogenous deployments, you’d think you’d see a 1 size fits all optimal GC configuration. Unfortunately my look into this has shown that there are often multiple optimal configurations. The only way to combat this is  with smarter ergonomics in the runtime.
>
>
> My proposal is just a tiny stopgap that might help alleviate some of
> the problems but does not attempt to be a holistic solution and, as
> you pointed out, has downsides.
> I totally agree with your assessment: it is just exposing internal
> constants but the fact that these are constants is part of the problem
> because they bake in an eager heap expansion behavior which is not
> necessarily desired.
> I share your reluctance to adding more obscure tuning flags: it has
> maintenance cost and a risk of misuse. I would not recommend anyone
> tuning these flags without reading the source code and understanding
> the tradeoffs.
> These are not silver bullets and, as you pointed out, probably would
> have to be used together with other tuning parameters to achieve
> reasonable results.
> To clarify, the way we plan to use these flags is to establish a
> constant set of tuning parameters that achieve a good tradeoff between
> latency, throughput and footprint and apply it to a large number of
> services.
> We want to avoid tuning each service individually because it is hard
> to scale. Example configuration (used with jdk17):
>        -XX:+UnlockExperimentalVMOptions -XX:+G1PeriodicGCInvokesConcurrent
>        -XX:G1PeriodicGCInterval=60000 -XX:G1PeriodicGCSystemLoadThreshold=0
>        -XX:GCTimeRatio=9 -XX:G1MixedGCLiveThresholdPercent=85
>        -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=60
> -XX:MaxGCPauseMillis=200 -XX:GCPauseIntervalMillis=1000
>        -XX:-G1UsePreventiveGC -XX:-G1ScaleWithHeapPauseTimeThreshold
> -XX:G1MinPausesOverThresholdForGrowth=10
>
>
> A nightmare that can be avoided with smarter ergonomics.
>
>
> From experiments so far it seems that we can leave the adaptive IHOP
> on because even if it mispredicted, e.g. due to allocation spikes, the
> heap is not aggressively expanded.
>
> On the plus side, the change itself is tiny, very localized and could
> be trivially backported e.g. all the way to jdk17. Most importantly,
> it seems to enable significant cost savings.
>
> At the end of the day it is a tradeoff. Would it help if I provided
> examples of the impact this change had on real life applications? At
> Databricks we run hundreds of JVM services and initial results are
> very promising. Or should I treat this proposal as officially
> rejected?
>
> Wouldn't the option to make G1 to keep GCTimeRatio better (e.g.
> https://bugs.openjdk.org/browse/JDK-8238687), and/or some configurable
> soft heap size goal (https://bugs.openjdk.org/browse/JDK-8236073) that
> the collector will keep also solve your issue while being easier to
> configure?
>
> Thanks for sharing these. The JDK-8238687 focuses on uncommit while
> the heap expansion hurts the most.
> The SoftMaxHeapSize could be used as a building block towards a
> solution. I think there still would have to be some controller that
> adjusts the value of SoftMaxHeapSize based on GC behavior e.g.
> increase it when GC pressure is too high.
>
>
> Having more data is always a good thing so I would welcome anything you can share.
>
> I pub’ed a table that suggests that GC CPU utilization, and not allocation rates, is a key metric to drive heap sizing. The other key metric is availability of RAM. Again, ZGC has this worked out so we’re integrating that work into ours.
>
> Kind regards,
> Kirk
>