[External] : Re: RFC: JEP: ZGC: Automatic Heap Sizing

Erik Osterlund erik.osterlund at oracle.com
Thu Jun 13 16:48:58 UTC 2024


Hi Kirk,

Serial does have two different reserved spaces, but they form a single contiguous heap (to get compressed oops), separated by a static boundary set up at boot time. It’s the same for Parallel. Parallel used to have a JVM flag for dynamically changing the boundaries called UseAdaptiveGCBoundary. It was never stable enough and got removed (cf. https://bugs.openjdk.org/browse/JDK-8228991) because of that. Having said that, perhaps the initial static boundaries based on the largest heap size you can grow to, is enough to allow generations to be sized reasonably well, without moving the boundaries. Especially since the heuristic choice of Serial is already done when the maximum heap size is bound to be fairly small, and even the maximum size then, will used the cheap compressed oops encoding.

It’s worth mentioning that churning around committing and uncommitting memory can cause a substantial CPU cost. So having a fluctuating heap capacity might cause CPU overheads, which itself is used to drive changes to the heap size. I’d be a bit weary of “echo” effects there. In my implementation, I’m more swift when it comes to increasing the heap size because it can be a major performance bottleneck if it is too small. But shrinking the heap is usually less urgent and is under normal circumstances done more gradually and carefully. First of all I normally shrink the heap only in major collections, as only major collections have a full understanding of the total CPU impact of the GC system. But I also separate shrinking the target capacity from actually uncommitting memory, which is further delayed and done concurrently, in order to avoid premature decisions to uncommitt memory. Uncommitting causes TLB shootdowns and other overheads, and is avoided as much as possible. The exception is when memory pressure on the machine is high, causing actual urgency for uncommitting. The uncommit delay is scaled by the memory pressure which is continuously monitored. When it gets critical, we are ready to very rapidly (within a couple of milliseconds) detecting the high memory pressure and uncommitting memory back to the OS.

I guess what I’m trying to say is that having an unstable heap capacity can be contagious and might risk making the CPU overhead unstable (and unnecessarily high) as well, and that uncommitting is less urgent if there is more memory available, but more urgent when resources get depleted. I have found it beneficial to have a policy that reflects that.

Oh, and cool tool you linked! :-)

/Erik

On 13 Jun 2024, at 05:40, Kirk Pepperdine <kirk at kodewerk.com> wrote:

Hi Erik,

Thank you very much for the very detailed analysis. It matches my experiences when tuning GC.

For the Serial collector there appears to be memory reserved for Young and reserved memory for Tenured. The current algorithm resizes both spaces separately. My proposal wouldn’t change that. Resizing would take place at the end of a collection where for the case of a Young collection, Eden and the from space will be empty and for a full collection, young will be empty and tenured should be compacted (sans anything being pinned). The heuristic is to use GC overhead’s distance from a target GC overhead value to drive the decision to expand or contract. My current thinking is to use 1% as an initial target. The aim is always to minimize the maximum amount of memory committed for all spaces such that GC overhead approaches this 1% target. As I think you mentioned before, I’m looking for stable CPU saturation while tolerating an unstable heap size. The heuristics should also be aware of the amount of memory available so that it doesn’t commit to the point of triggering the OOM killer. Finally, the heuristic needs to consider how to minimize premature promotion as this is one pathology that can significantly drive up GC overhead. Take-away, I don’t want to be too aggressive in reducing Young as that may leave Survivors undersized which will trigger overt and covert premature promotion driving up the frequency of Full collections.

As for Microservice latencies, you might be interested at https://github.com/kcpeppe/microservicepausemodel<https://urldefense.com/v3/__https://github.com/kcpeppe/microservicepausemodel__;!!ACWV5N9M2RV99hQ!JB-TU78Bv9kjjA23fXTTInhIkfHL0ZQqk-MiQ5Vd3iOOB9h65Mgl2McfjPp-SELkrwyuJ93Fur2geb5Lrg$>. It has a four different views on expected latency given Microservice chain length.



On Jun 12, 2024, at 12:52 AM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Bruno,

Changing the boundaries between the old and young generations can significantly impact the overall cost of doing GC. Let’s consider a hypothetical GC running a hypothetical program to demonstrate this.

In our hypothetical and simplified system, we have constrained the old generation to 20M, and at any given point, 10M is live in the old generation. We start with a 30M heap, meaning that we give 10M to the young generation. Each young collection processes 1M live bytes, and promotes half of it to the old generation. We can now estimate the CPU cost as how many bytes we have to process in the system, and compare that to how much memory we manage to recycle.

Since 0.5M is promoted per YC, it will take 20 YCs until we need to perform a full GC due to promoting memory up to the 20M old gen capacity (half was already resident). These YCs will process a total of 20M. The reclaimed bytes due to young collections is: 9.5 * 20 = 190M

The total reclaimed bytes due to the YCs and the resetting full GC is 190 + 20 = 210M
The total processed bytes due to the YCs and the resetting full GC is 20 + 10 = 30M

So the memory reclamation cost is now 7 reclaimed bytes per processed byte., and about 1/3 of that overall cost, is due to walking the old generation which had an order of magnitude larger residency than the young generation did. So even though its frequency of collection is relatively low due to giving the old generation a majority of the memory, it still has a significant impact on the cost of doing GC.

Let's expand the heap to, say, 1000M. If the old generation sticks to its previous 20M, then the young generation gets to use 980M. We can now do the calculation again.
We still need a full GC every 20 YCs, because we didn’t give the old generation any more memory.

Total reclaimed bytes due to YCs and the resetting full GC is 979.5 * 20 + 990 = 20 570M
Total processed bytes due to YCs and the resetting full GC is still 20 + 10 = 30M

So now the memory reclamation cost has improved massively to 686 reclaimed bytes per processed byte, but it’s still the case that about 1/3 of that overall cost is due to processing the old generation which now constitutes 2% of our heap. It would be tempting to increase the capacity for the old generation to improve our CPU efficiency of doing GC further. Let’s see what happens if we move another 20M from the young generation to the old generation.

We now have a 1000M heap, 40M capacity in the old generation, and 960M capacity in the young generation. In other words, we made a 100% increase of the old generationcapacity, at the cost of ~2% smaller young generation. Since the old generation still contributed a significant cost, this should be favourable. Let’s see what happens.

Total reclaimed bytes due to YCs and the resetting full GC is 959.5 * 40 + 990 = 39 370
Total processed bytes due to YCs and the resetting full GC is 40 + 10 = 50M

The overall cost per reclaimed byte has now improved to 787.4 reclaimed bytes per processed bytes; a ~15% improvement to the CPU cost efficiency of doing GC. We can still improve CPU cost efficiency by further expanding the old generation.

As you can imagine, if we were to now shrink the heap to, say 42M, and we keep 40M capacity for the old generation, the efficiency of the young collector is going to take a significant hit, as the reclaimed bytes per YC is going to drop significantly to 1.5M. So now it would be desirable to shrink the capacity of the old generation as well, to improve the efficiency of YCs.

In generational ZGC we don’t have a notion of capacity for generations, and the pool of shared memory is shared. It automatically triggers major collections at frequencies calculated to minimize the overall CPU cost of performing GC in total. So when I increase and decrease the heap, it will automatically rebalance the generations in a CPU cost minimizing way, adapting to the new circumstances.

Naturally, for a STW GC, there is also a fun latency component to consider. Sometimes it might be more CPU efficient to increase the frequency of old generation collections. But that might start degrading the responsiveness of the system. Similarly, increasing the young generation size can also lead to worse latencies, in a system where there is more churn between the generations.

In a system with micro services where each external request forks out to 20 internal requests, getting a 1% chance for a bad GC pause per internal request, leads to a ~18.2% for the external request to get impacted by at least one of the internal requests hitting a bad GC pause. However, a 10% chance of hitting a bad GC pause per internal request, leads to an 87.8% chance that an external request gets hit by a bad GC pause.

Hope this helps understanding the problem space a bit. I still think this is very cool and I encourage you to investigate this further. But I thought I’d help by pointing at some pitfalls you will probably encounter beforehand. ;-)

Thanks,
/Erik


On 11 Jun 2024, at 19:35, Bruno Borges <Bruno.Borges at microsoft.com<mailto:Bruno.Borges at microsoft.com>> wrote:

Erik,

Would it be reasonable to assume that only Young(er) generation(s) should be shrunk following a heap shrink event? If not, why?
________________________________
From: hotspot-gc-dev <hotspot-gc-dev-retn at openjdk.org<mailto:hotspot-gc-dev-retn at openjdk.org>> on behalf of Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>>
Sent: June 11, 2024 7:50 AM
To: Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>>
Cc: hotspot-gc-dev at openjdk.org<mailto:hotspot-gc-dev at openjdk.org> <hotspot-gc-dev at openjdk.org<mailto:hotspot-gc-dev at openjdk.org>>
Subject: Re: [External] : Re: RFC: JEP: ZGC: Automatic Heap Sizing

Hi Kirk,

Do I read the table right that the idea is to basically shrink when CPU saturation is low, and expand when it’s high?
If so it’s worth keeping in mind that the high level problem I’m trying to solve is that we use a static amount of memory whether it’s needed or not. This smells like it might take a static amount of CPU (whatever we define as moderate), whether it’s needed or not instead, as we hold a CPU magnet around whatever level is “moderate". As a Swedish GC designer, I think it’s important to also teach the heuristics the notion of lagom (cf. https://en.wikipedia.org/wiki/Lagom<https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Lagom__;!!ACWV5N9M2RV99hQ!IANAIybO3Tete2sy3Kmy3IBJsZDkXXJn3eeHJ-yLixjptqnCxffzWg9br3xgRAGL8P-FjBityrk4a1sjF52djAFsPqa_$>), to prevent a process that could have gotten away with using 1% CPU from ending up using, say, 50% CPU, just so it could shrink the last couple of bytes of heap, when memory maybe wasn’t really a problem at all in the first place.

Another thing worth thinking about when changing the heap size in a GC like Serial, is that changing the heap size will change the size of the generations, and hence we probably need to teach Serial how to move the boundaries between the young and old generations. I suspect that will be an interesting experience. Just a heads up!

Kind regards,
/Erik

On 4 Jun 2024, at 21:18, Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>> wrote:

Hi Erik,

Thanks for the link. This all reads very nicely all but your favorite line because involves Pi because we all know cake is square and Pi is round.

The current ergonomic sizing for the serial collector is a function of the number of non-Daemon threads. I don’t think that the serial collector needs logic as complex as what you’ve put in play here. I think I can get away with a simple GC overhead calculation. I based that assessment on the following chart.


Allocation Pressure
CPU Saturation
Action
Low
Low
Shrink
Moderate
Low
Shrink
High
Low
Shrink
Low
Moderate
None
Moderate
Moderate
None
High
Moderate
None
Low
High
Expand
Moderate
High
Expand
High
High
Expand


If we take GC overhead ~= CPU saturation, then this chart implies that setting a GC throughput target and then resizing Java heap as needed to meet that target is all that is needed.  I intend to experiment with this logic to start with.

Kind regards,
Kirk

On May 29, 2024, at 5:10 PM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

I have a prototype here if you are interested in having a peek at what I’m cooking: https://github.com/fisk/jdk/tree/zgc_auto_heap_v3<https://urldefense.com/v3/__https://github.com/fisk/jdk/tree/zgc_auto_heap_v3__;!!ACWV5N9M2RV99hQ!MMvDR8LjE_cUChZ5ItpFCN28jqvepjpCHTuLf0jItSDsfKIflro8v5R5EDU0NDS7Dzu3PljSi2ORA3dzKw$>

I have some additional things I would like to try out, but it’s starting to shape up pretty well I think.

The exponential backoff logic is here: https://github.com/fisk/jdk/blob/22ca93998b7018394338b07f51659815faf69bfa/src/hotspot/share/gc/z/zAdaptiveHeap.cpp#L191<https://urldefense.com/v3/__https://github.com/fisk/jdk/blob/22ca93998b7018394338b07f51659815faf69bfa/src/hotspot/share/gc/z/zAdaptiveHeap.cpp*L191__;Iw!!ACWV5N9M2RV99hQ!MMvDR8LjE_cUChZ5ItpFCN28jqvepjpCHTuLf0jItSDsfKIflro8v5R5EDU0NDS7Dzu3PljSi2O2oChJXQ$>
The best line in the entire patch is here: https://github.com/fisk/jdk/blob/22ca93998b7018394338b07f51659815faf69bfa/src/hotspot/share/gc/z/zAdaptiveHeap.cpp#L101<https://urldefense.com/v3/__https://github.com/fisk/jdk/blob/22ca93998b7018394338b07f51659815faf69bfa/src/hotspot/share/gc/z/zAdaptiveHeap.cpp*L101__;Iw!!ACWV5N9M2RV99hQ!MMvDR8LjE_cUChZ5ItpFCN28jqvepjpCHTuLf0jItSDsfKIflro8v5R5EDU0NDS7Dzu3PljSi2O_WeqjBA$>

When it comes to reducing footprint for a memory-idle app, ZGC has an uncommitter thread that that uncommits memory regions when they “time out”. Basically, if you haven’t used a region for X amount of time, then it will get uncommitted. Normally this X is 5 minutes. However, I scale this timeout by the reciprocal of the memory pressure in this patch, which is continuously monitored. A consequence of that is that when memory gets increasingly scarce, we will start desperately dumping all memory we can get rid of, which will cause the heap to shrink, and GC to kick in which can lead to more opportunities of shrinking further, should the situation be desperate.

Anyway, hope you enjoy the reading!

Kind regards,
/Erik

On 29 May 2024, at 16:03, Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>> wrote:

Hi Erik,

I’ve started looking at the serial collector and I’m interested in experimenting with the exponential function also. I’d also like to add a part where a memory-idle app could reduce it’s memory footprint. That said, to achieve this I think one will need a speculative full and I’m not so happy about introducing yet another speculative collection tactic given that all past attempts to do good speculatively have not ended well.

I look forward to looking at your code.

Kind regards,
Kirk


On May 28, 2024, at 7:07 PM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

Yeah, I think the approach itself should work well for Serial and Parallel as well for the most part. I personally welcome any work on this in the context of Serial and Parallel, if your team would like to have a look at that.

Kind regards,
/Erik

On 16 May 2024, at 17:19, Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>> wrote:

 Hi Erik,

I’m glad this worked. I’d like to see a solution that works across all collectors. We’re looking to experiment with the serial/parallel collector for the cases when applications will be running in smaller containers where these collectors are a more suitable choice.

Kind regards,
Kirk


On May 13, 2024, at 5:54 PM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

I experimented with taking aside a small portion of the machine memory, and say it is “critical”. The GC pressure is then scaled by an exponential function over what fraction of the critical memory reserve is used on the machine. Unless memory pressure on the machine is very high, it changes nothing in the behaviour. But as it gets critically high, it makes the GC try a lot harder as we run out of memory, and at the same time gives all processes a unified view of what the pressure is, and you get into an equilibrium situation where all processes apply similar GC pressure to avoid running out of memory. I also scale the delay for uncommitting memory by said memory pressure which causes us to uncommit memory increasingly aggressively, as memory levels get increasingly critical.

Of course, you can always allocate something that uses exactly the amount of memory that is left on the machine, and you will have a problem because we don’t have enough time to shrink. But in practice it seems to reduce the amount of trouble due to coordination with other processes significantly. I have run both homogeneous and heterogeneous experiments that normally just should absolutely not work, that works deterministically fine with this mechanism. I think it’s worth adding this mechanism to the scope of the JEP, to further reduce the probability that users need to mess with heap sizing. So thanks for the suggestion to have a look at this.

Kind regards,
/Erik

On 3 May 2024, at 15:25, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

On 2 May 2024, at 18:59, Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>> wrote:

Hi Erik,

Some questions.

On the question of allocation stalls, couldn’t/shouldn’t one just start the collection cycles sooner?

Yes - if nothing unexpected happens, then we will indeed just start earlier. But when unpredicted things happen, such as large unpredicted surges in allocation rate, or sudden increase in residency compared to earlier collections, there is always a risk of stalling. Just taking some extra memory is typically preferred, instead of stalling.

On the question of sharing in containers, I’m aware of two different experiments on how to resolve the issue of allowing Java heap to consume more of available memory. Of the two the most interesting is a modification of G1 that uses GC thread CPU overhead as a signal to decide if Java heap should be expanded or contracted. The idea is to keep GC thread utilization within a band (of ~20%). The other component of this experiment is to fail any attempts to expand should that expansion risk triggering an OOM killer. In this case the failed expansion will result in a hotter CPU. That said, this a significantly more graceful degredation than having the process terminated. My question is, would you consider taking a similar approach for ZGC?

The first part of it sounds indeed quite similar to what I’m planning.

Regarding reacting to some “dangerously high” watermark, I have thought a bit about that, and have so far decided to stay out of it. Mostly because I haven’t thought of any promising way of doing that without getting strange echo effects where multiple JVMs are shrinking and expanding continuously based on reactions, and reactions to said reactions, rather than a carefully planned global plan from an external source controlling them.

What might work reasonably well though, is to compute the GC pressure if it hasn’t been specified, to something like 1.0 / MIN(0.2, memory_left_fraction). This way, the GC pressure is “medium” until you get to the last 20% of memory, and then increases proportionally, as the last 20% memory on the machine starts getting used up. The nice thing with such an approach, is that all JVMs agree about the GC pressure, and will do their fair share extra work to keep it down, without oscillating up and down or having a single escape goat JVM that takes all the burden. I’ll try that and see if that can work well.

Finally, on the flag ZGCPressure, could you speak a little more about how it balances CPU vs memory? Specifically, what does the value 5 represent? I understand if one were to pull on that lever you affect the level of aggressiveness but is this aggressiveness of how quickly heap would be expanded?

I have intentionally been a bit vague here, as I am hoping to be able to change the implementation over time without being tied down to a specific contract that becomes impossible to conform to in the future where this functionality becomes more complex.
Having said that, 5 is a sort of medium level of aggressiveness. Considering CPU utilization alone, if a process is utilizing 100% of the available CPU resources on a machine, then it will use approximately 1/8 of the available CPU, for doing GC. But it will still perform GC less aggressively, if the frequency seems too high.

Kind regards,
/Erik

Kind regards,
Kirk


On May 2, 2024, at 7:44 AM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi,

I have written a draft JEP for a automatic heap sizing when using ZGC.
JEP description is available here: https://bugs.openjdk.org/browse/JDK-8329758

Comments and feedback are welcome.

Thanks,
/Erik



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20240613/065b1aee/attachment-0001.htm>


More information about the hotspot-gc-dev mailing list