Container-aware heap sizing for OpenJDK

Wed Sep 14 11:10:03 UTC 2022

Intresting topic Adaptable Heap Sizing (AHS). Any chance to be considered
as a JEP in OpenJDK?.

On Wed 14 Sep, 2022, 12:47 AM Jonathan Joo, <jonathanjoo at google.com> wrote:

> Hello hotspot-dev and hotspot-gc-dev,
>
> My name is Jonathan, and I'm working on the Java Platform Team at Google.
> Here, we are working on a project to address Java container memory issues,
> as we noticed that a significant number of Java servers hit container OOM
> issues due to people incorrectly tuning their heap size with respect to the
> container size. Because our containers have other RAM consumers which
> fluctuate over time, it is often difficult to determine a priori what is an
> appropriate Xmx to set for a particular server.
>
> We set about trying to solve this by dynamically adjusting the Java
> heap/gc behavior based on the container usage information that we pass into
> the JVM. We have seen promising results so far, reducing container OOMs by
> a significant amount, and oftentimes also reducing average heap usage (with
> the tradeoff of more CPU time spent doing GC).
>
> Below (under the dotted line) is a more detailed explanation of our
> initial approach. Does this sound like something that may be useful for the
> general OpenJDK community? If so, would some of you be open to further
> discussion? I would also like to better understand what container
> environments look like outside of Google, to see how we could modify our
> approach for the more general case.
>
> Thank you!
>
>
> Jonathan
> ------------------------------------------------------------------------
> Introduction:
>
> Adaptable Heap Sizing (AHS) is a project internal to Google that is meant
> to simplify configuration and improve the stability of applications in
> container environments. The key is that in a containerized environment, we
> have access to container usage and limit information. This can be used as a
> signal to modify Java heap behavior, helping prevent container OOMs.
> Problem:
>
>    -
>
>    Containers at Google must be properly sized to not only the JVM heap,
>    but other memory consumers as well. These consumers include non-heap Java
>    (e.g. native code allocations), and simultaneously running non-Java
>    processes.
>    -
>
>    Common antipattern we see here at Google:
>    -
>
>       We have an application running into container OOMs.
>       -
>
>       An engineer raises both container memory limit and Xmx by the same
>       amount, since there appears to be insufficient memory.
>       -
>
>       The application has reduced container OOMs, but is still prone to
>       them, since G1 continues to use most of Xmx.
>       -
>
>    This results in many jobs being configured with much more RAM than
>    they need, but still running into container OOM issues.
>
> Hypothesis:
>
>    -
>
>    For preventing container OOM: Why can't heap expansions be bounded by
>    the remaining free space in the container?
>    -
>
>    For preventing the `unnecessarily high Xmx` antipattern: Why can't
>    target heap size be set based on GC CPU overhead?
>    -
>
>    From our work on Adaptable Heap Sizing, it appears they can!
>
> Design:
>
>    -
>
>    We add two manageable flags in the JVM
>    -
>
>       Current maximum heap expansion size
>       -
>
>       Current target heap size
>       -
>
>    A separate thread runs alongside the JVM, querying:
>    -
>
>       Container memory usage/limits
>       -
>
>       GC CPU overhead metrics from the JVM.
>       -
>
>    This thread then uses this information to calculate new values for the
>    two new JVM flags, and continually updates them at runtime.
>    -
>
>    The `Current maximum heap expansion size` informs the JVM what is the
>    maximum amount we can expand the heap by, while staying within container
>    limits. This is a hard limit, and trying to expand more than this amount
>    results in behavior equivalent to hitting the Xmx limit.
>    -
>
>    The `Current target heap size` is a soft target value, which is used to
>    resize the heap (when possible) so as to bring GC CPU overhead toward its
>    target value.
>
>
> Results:
>
>    -
>
>    At Google, we have found that this design works incredibly well in our
>    initial rollout, even for large and complex workloads.
>    -
>
>    After deploying this to dozens of applications:
>    -
>
>       Significant memory savings for previously misconfigured jobs (many
>       of which reduced their heap usage by 50% or more)
>       -
>
>       Significantly reduced occurrences of container OOM (100% reduction
>       in vast majority of cases)
>       -
>
>       No correctness issues
>       -
>
>       No latency regressions*
>       -
>
>       We plan to deploy AHS across a much wider subset of applications by
>       EOY '22.
>
>
> *Caveats:
>
>    - Enabling this feature might require tuning of the newly introduced
>    default GC CPU overhead target to avoid regressions.
>    -
>
>    Time spent doing GC for an application may increase significantly
>    (though generally we've seen in practice that even if this is the case,
>    end-to-end latency does not increase a noticeable amount)
>    -
>
>    Enabling AHS results in frequent heap resizings, but we have not seen
>    evidence of any negative effects as a result of these more frequent heap
>    resizings.
>    -
>
>    AHS is not necessarily a replacement for proper JVM tuning, but should
>    generally work better than an untuned or improperly tuned configuration.
>    -
>
>    AHS is not intended for every possible workload, and there could be
>    pathological cases where AHS results in worse behavior.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20220914/b6661b92/attachment.htm>