[External] : Re: RFC: JEP: ZGC: Automatic Heap Sizing

Wed May 29 15:10:50 UTC 2024

Hi Kirk,

I have a prototype here if you are interested in having a peek at what I’m cooking: https://github.com/fisk/jdk/tree/zgc_auto_heap_v3

I have some additional things I would like to try out, but it’s starting to shape up pretty well I think.

The exponential backoff logic is here: https://github.com/fisk/jdk/blob/22ca93998b7018394338b07f51659815faf69bfa/src/hotspot/share/gc/z/zAdaptiveHeap.cpp#L191
The best line in the entire patch is here: https://github.com/fisk/jdk/blob/22ca93998b7018394338b07f51659815faf69bfa/src/hotspot/share/gc/z/zAdaptiveHeap.cpp#L101

When it comes to reducing footprint for a memory-idle app, ZGC has an uncommitter thread that that uncommits memory regions when they “time out”. Basically, if you haven’t used a region for X amount of time, then it will get uncommitted. Normally this X is 5 minutes. However, I scale this timeout by the reciprocal of the memory pressure in this patch, which is continuously monitored. A consequence of that is that when memory gets increasingly scarce, we will start desperately dumping all memory we can get rid of, which will cause the heap to shrink, and GC to kick in which can lead to more opportunities of shrinking further, should the situation be desperate.

Anyway, hope you enjoy the reading!

Kind regards,
/Erik

On 29 May 2024, at 16:03, Kirk Pepperdine <kirk at kodewerk.com> wrote:

Hi Erik,

I’ve started looking at the serial collector and I’m interested in experimenting with the exponential function also. I’d also like to add a part where a memory-idle app could reduce it’s memory footprint. That said, to achieve this I think one will need a speculative full and I’m not so happy about introducing yet another speculative collection tactic given that all past attempts to do good speculatively have not ended well.

I look forward to looking at your code.

Kind regards,
Kirk

On May 28, 2024, at 7:07 PM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

Yeah, I think the approach itself should work well for Serial and Parallel as well for the most part. I personally welcome any work on this in the context of Serial and Parallel, if your team would like to have a look at that.

Kind regards,
/Erik

On 16 May 2024, at 17:19, Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>> wrote:

 Hi Erik,

I’m glad this worked. I’d like to see a solution that works across all collectors. We’re looking to experiment with the serial/parallel collector for the cases when applications will be running in smaller containers where these collectors are a more suitable choice.

Kind regards,
Kirk

On May 13, 2024, at 5:54 PM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

I experimented with taking aside a small portion of the machine memory, and say it is “critical”. The GC pressure is then scaled by an exponential function over what fraction of the critical memory reserve is used on the machine. Unless memory pressure on the machine is very high, it changes nothing in the behaviour. But as it gets critically high, it makes the GC try a lot harder as we run out of memory, and at the same time gives all processes a unified view of what the pressure is, and you get into an equilibrium situation where all processes apply similar GC pressure to avoid running out of memory. I also scale the delay for uncommitting memory by said memory pressure which causes us to uncommit memory increasingly aggressively, as memory levels get increasingly critical.

Of course, you can always allocate something that uses exactly the amount of memory that is left on the machine, and you will have a problem because we don’t have enough time to shrink. But in practice it seems to reduce the amount of trouble due to coordination with other processes significantly. I have run both homogeneous and heterogeneous experiments that normally just should absolutely not work, that works deterministically fine with this mechanism. I think it’s worth adding this mechanism to the scope of the JEP, to further reduce the probability that users need to mess with heap sizing. So thanks for the suggestion to have a look at this.

Kind regards,
/Erik

On 3 May 2024, at 15:25, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi Kirk,

On 2 May 2024, at 18:59, Kirk Pepperdine <kirk at kodewerk.com<mailto:kirk at kodewerk.com>> wrote:

Hi Erik,

Some questions.

On the question of allocation stalls, couldn’t/shouldn’t one just start the collection cycles sooner?

Yes - if nothing unexpected happens, then we will indeed just start earlier. But when unpredicted things happen, such as large unpredicted surges in allocation rate, or sudden increase in residency compared to earlier collections, there is always a risk of stalling. Just taking some extra memory is typically preferred, instead of stalling.

On the question of sharing in containers, I’m aware of two different experiments on how to resolve the issue of allowing Java heap to consume more of available memory. Of the two the most interesting is a modification of G1 that uses GC thread CPU overhead as a signal to decide if Java heap should be expanded or contracted. The idea is to keep GC thread utilization within a band (of ~20%). The other component of this experiment is to fail any attempts to expand should that expansion risk triggering an OOM killer. In this case the failed expansion will result in a hotter CPU. That said, this a significantly more graceful degredation than having the process terminated. My question is, would you consider taking a similar approach for ZGC?

The first part of it sounds indeed quite similar to what I’m planning.

Regarding reacting to some “dangerously high” watermark, I have thought a bit about that, and have so far decided to stay out of it. Mostly because I haven’t thought of any promising way of doing that without getting strange echo effects where multiple JVMs are shrinking and expanding continuously based on reactions, and reactions to said reactions, rather than a carefully planned global plan from an external source controlling them.

What might work reasonably well though, is to compute the GC pressure if it hasn’t been specified, to something like 1.0 / MIN(0.2, memory_left_fraction). This way, the GC pressure is “medium” until you get to the last 20% of memory, and then increases proportionally, as the last 20% memory on the machine starts getting used up. The nice thing with such an approach, is that all JVMs agree about the GC pressure, and will do their fair share extra work to keep it down, without oscillating up and down or having a single escape goat JVM that takes all the burden. I’ll try that and see if that can work well.

Finally, on the flag ZGCPressure, could you speak a little more about how it balances CPU vs memory? Specifically, what does the value 5 represent? I understand if one were to pull on that lever you affect the level of aggressiveness but is this aggressiveness of how quickly heap would be expanded?

I have intentionally been a bit vague here, as I am hoping to be able to change the implementation over time without being tied down to a specific contract that becomes impossible to conform to in the future where this functionality becomes more complex.
Having said that, 5 is a sort of medium level of aggressiveness. Considering CPU utilization alone, if a process is utilizing 100% of the available CPU resources on a machine, then it will use approximately 1/8 of the available CPU, for doing GC. But it will still perform GC less aggressively, if the frequency seems too high.

Kind regards,
/Erik

Kind regards,
Kirk

On May 2, 2024, at 7:44 AM, Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:

Hi,

I have written a draft JEP for a automatic heap sizing when using ZGC.
JEP description is available here: https://bugs.openjdk.org/browse/JDK-8329758

Comments and feedback are welcome.

Thanks,
/Erik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20240529/d4e48758/attachment.htm>