Container-aware heap sizing for OpenJDK
    Ioi Lam 
    ioi.lam at oracle.com
       
    Thu Sep 15 05:19:13 UTC 2022
    
    
  
Hi Jonathan,
Thanks for starting this discussion. This is a topic that I am 
interested in as well.
I think the general question is:
"How do I use as much memory as possible for the JVM without getting 
OOM-killed".
Traditionally, the JVM automatically picks an Xmx that's about 1/4 of 
the total physical RAM on the host. This is OK for for plain old servers 
or desktop environments that runs lots of processes. However, in 
containers which typically have a very small number of processes, such a 
default is too conservative.
However, if you set the -Xmx too high, the total amount of memory used 
by the JVM can be unpredictable, because there are allocations inside 
and outside of the Java heap. Also, in your case, you have other 
processes running inside the same container that the JVM has no direct 
control upon.
It makes sense for the JVM to dynamically adjust its size, but we should 
think about different scenarios to see what our goals should be
- In the simplest case, you have a single JVM process running inside the 
container. How do you balance its Java heap vs non-Java heap usage?
- If you have two JVM processes running inside the container, how do 
they coordinate?
- If the fluctuation is caused by other processes, can the JVM react 
quickly (run GC and free up caches) to respond to quick spikes? Do we 
need to configure the container to allow temporarily over-budget 
(something like "you can be 100MB over budget for less than 20ms") so 
the JVM has time to shrink itself?
- Conversely, how can a spiky process request the JVM to temporarily 
give up some memory?
It seems to me that for the more complex scenarios, it's not enough for 
each individual JVM to make decisions on its own. We may need some sort 
of intra-process coordination.
Thanks
- Ioi
On 9/13/2022 12:52 PM, Jonathan Joo wrote:
>
> Hello hotspot-dev and hotspot-gc-dev,
>
>
> My name is Jonathan, and I'm working on the Java Platform Team at 
> Google. Here, we are working on a project to address Java container 
> memory issues, as we noticed that a significant number of Java servers 
> hit container OOM issues due to people incorrectly tuning their heap 
> size with respect to the container size. Because our containers have 
> other RAM consumers which fluctuate over time, it is often difficult 
> to determine a priori what is an appropriate Xmx to set for a 
> particular server.
>
>
> We set about trying to solve this by dynamically adjusting the Java 
> heap/gc behavior based on the container usage information that we pass 
> into the JVM. We have seen promising results so far, reducing 
> container OOMs by a significant amount, and oftentimes also reducing 
> average heap usage (with the tradeoff of more CPU time spent doing GC).
>
>
> Below (under the dotted line) is a more detailed explanation of our 
> initial approach. Does this sound like something that may be useful 
> for the general OpenJDK community? If so, would some of you be open to 
> further discussion? I would also like to better understand what 
> container environments look like outside of Google, to see how we 
> could modify our approach for the more general case.
>
>
> Thank you!
>
> Jonathan
>
>
>       ------------------------------------------------------------------------
>
>
>       Introduction:
>
> Adaptable Heap Sizing (AHS) is a project internal to Google that is 
> meant to simplify configuration and improve the stability of 
> applications in container environments. The key is that in a 
> containerized environment, we have access to container usage and limit 
> information. This can be used as a signal to modify Java heap 
> behavior, helping prevent container OOMs.
>
>
>       Problem:
>
>  *
>
>     Containers at Google must be properly sized to not only the JVM
>     heap, but other memory consumers as well. These consumers include
>     non-heap Java (e.g. native code allocations), and simultaneously
>     running non-Java processes.
>
>  *
>
>     Common antipattern we see here at Google:
>
>      o
>
>         We have an application running into container OOMs.
>
>      o
>
>         An engineer raises both container memory limit and Xmx by the
>         same amount, since there appears to be insufficient memory.
>
>      o
>
>         The application has reduced container OOMs, but is still prone
>         to them, since G1 continues to use most of Xmx.
>
>  *
>
>     This results in many jobs being configured with much more RAM than
>     they need, but still running into container OOM issues.
>
>
>       Hypothesis:
>
>  *
>
>     For preventing container OOM: Why can't heap expansions be bounded
>     by the remaining free space in the container?
>
>  *
>
>     For preventing the `unnecessarily high Xmx` antipattern: Why can't
>     target heap size be set based on GC CPU overhead?
>
>  *
>
>     From our work on Adaptable Heap Sizing, it appears they can!
>
>
>       Design:
>
>  *
>
>     We add two manageable flags in the JVM
>
>      o
>
>         Current maximum heap expansion size
>
>      o
>
>         Current target heap size
>
>  *
>
>     A separate thread runs alongside the JVM, querying:
>
>      o
>
>         Container memory usage/limits
>
>      o
>
>         GC CPU overhead metrics from the JVM.
>
>  *
>
>     This thread then uses this information to calculate new values for
>     the two new JVM flags, and continually updates them at runtime.
>
>  *
>
>     The `Current maximum heap expansion size` informs the JVM what is
>     the maximum amount we can expand the heap by, while staying within
>     container limits. This is a hard limit, and trying to expand more
>     than this amount results in behavior equivalent to hitting the Xmx
>     limit.
>
>  *
>
>     The `Current target heap size` is a soft target value, which is
>     used to resize the heap (when possible) so as to bring GC CPU
>     overhead toward its target value.
>
>
>       Results:
>
>  *
>
>     At Google, we have found that this design works incredibly well in
>     our initial rollout, even for large and complex workloads.
>
>  *
>
>     After deploying this to dozens of applications:
>
>      o
>
>         Significant memory savings for previously misconfigured jobs
>         (many of which reduced their heap usage by 50% or more)
>
>      o
>
>         Significantly reduced occurrences of container OOM (100%
>         reduction in vast majority of cases)
>
>      o
>
>         No correctness issues
>
>      o
>
>         No latency regressions*
>
>      o
>
>         We plan to deploy AHS across a much wider subset of
>         applications by EOY '22.
>
>
>       *Caveats:
>
>  *
>
>
>           Enabling this feature might require tuning of the newly
>           introduced default GC CPU overhead target to avoid regressions.
>
>  *
>
>     Time spent doing GC for an application may increase significantly
>     (though generally we've seen in practice that even if this is the
>     case, end-to-end latency does not increase a noticeable amount)
>
>  *
>
>     Enabling AHS results in frequent heap resizings, but we have not
>     seen evidence of any negative effects as a result of these more
>     frequent heap resizings.
>
>  *
>
>     AHS is not necessarily a replacement for proper JVM tuning, but
>     should generally work better than an untuned or improperly tuned
>     configuration.
>
>  *
>
>     AHS is not intended for every possible workload, and there could
>     be pathological cases where AHS results in worse behavior.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20220914/5b2466ee/attachment-0001.htm>
    
    
More information about the hotspot-dev
mailing list