Container-aware heap sizing for OpenJDK

Wed Sep 14 09:53:46 UTC 2022

Hi,

On Tue, 2022-09-13 at 15:16 -0400, Jonathan Joo wrote:
> Hello hotspot-dev and hotspot-gc-dev,
> 
> My name is Jonathan, and I'm working on the Java Platform Team at
> Google. Here, we are working on a project to address Java container
> memory issues, as we noticed that a significant number of Java
> servers hit container OOM issues due to people incorrectly tuning
> their heap size with respect to the container size. Because our
> containers have other RAM consumers which fluctuate over time, it is
> often difficult to determine a priori what is an appropriate Xmx to
> set for a particular server. 
> 
> We set about trying to solve this by dynamically adjusting the Java
> heap/gc behavior based on the container usage information that we
> pass into the JVM. We have seen promising results so far, reducing
> container OOMs by a significant amount, and oftentimes also reducing
> average heap usage (with the tradeoff of more CPU time spent doing
> GC). 
> 
> Below (under the dotted line) is a more detailed explanation of our
> initial approach. Does this sound like something that may be useful
> for the general OpenJDK community? If so, would some of you be open
> to further discussion? I would also like to better understand what
> container environments look like outside of Google, to see how we
> could modify our approach for the more general case.

It seems an interesting proposal and I'd be interested in your work. A
few questions:

   1. How is AHS enabled? Is it on by default or is it opt-in?
   2. Is the prototype working for all GCs available in OpenJDK or
      specific to G1?
   3. Would this be a Linux only feature?

Thanks,
Severin

> Thank you!
> 
> Jonathan
> ---------------------------------------------------------------------
> ---Introduction:Adaptable Heap Sizing (AHS) is a project internal to
> Google that is meant to simplify configuration and improve the
> stability of applications in container environments. The key is that
> in a containerized environment, we have access to container usage and
> limit information. This can be used as a signal to modify Java heap
> behavior, helping prevent container OOMs.
> Problem: * Containers at Google must be properly sized to not only
> the JVM heap, but other memory consumers as well. These consumers
> include non-heap Java (e.g. native code allocations), and
> simultaneously running non-Java processes. 
>  * Common antipattern we see here at Google: 
>     - We have an application running into container OOMs. 
>     - An engineer raises both container memory limit and Xmx by the
> same amount, since there appears to be insufficient memory.
>     - The application has reduced container OOMs, but is still prone
> to them, since G1 continues to use most of Xmx.
>  * This results in many jobs being configured with much more RAM than
> they need, but still running into container OOM issues.
> Hypothesis: * For preventing container OOM: Why can't heap expansions
> be bounded by the remaining free space in the container?
>  * For preventing the `unnecessarily high Xmx` antipattern: Why can't
> target heap size be set based on GC CPU overhead?
>  * From our work on Adaptable Heap Sizing, it appears they can!
> Design: * We add two manageable flags in the JVM
>     - Current maximum heap expansion size
>     - Current target heap size
>  * A separate thread runs alongside the JVM, querying:
>     - Container memory usage/limits
>     - GC CPU overhead metrics from the JVM.
>  * This thread then uses this information to calculate new values for
> the two new JVM flags, and continually updates them at runtime.
>  * The `Current maximum heap expansion size` informs the JVM what is
> the maximum amount we can expand the heap by, while staying within
> container limits. This is a hard limit, and trying to expand more
> than this amount results in behavior equivalent to hitting the Xmx
> limit.
>  * The `Current target heap size` is a soft target value, which is
> used to resize the heap (when possible) so as to bring GC CPU
> overhead toward its target value. 
> 
> Results: * At Google, we have found that this design works incredibly
> well in our initial rollout, even for large and complex workloads.
>  * After deploying this to dozens of applications:
>     - Significant memory savings for previously misconfigured jobs
> (many of which reduced their heap usage by 50% or more)
>     - Significantly reduced occurrences of container OOM (100%
> reduction in vast majority of cases)
>     - No correctness issues
>     - No latency regressions*
>     - We plan to deploy AHS across a much wider subset of
> applications by EOY '22.
> 
> *Caveats:  * Enabling this feature might require tuning of the newly
> introduced default GC CPU overhead target to avoid regressions.
>  * Time spent doing GC for an application may increase significantly
> (though generally we've seen in practice that even if this is the
> case, end-to-end latency does not increase a noticeable amount)
>  * Enabling AHS results in frequent heap resizings, but we have not
> seen evidence of any negative effects as a result of these more
> frequent heap resizings.
>  * AHS is not necessarily a replacement for proper JVM tuning, but
> should generally work better than an untuned or improperly tuned
> configuration.
>  * AHS is not intended for every possible workload, and there could
> be pathological cases where AHS results in worse behavior.