Container-aware heap sizing for OpenJDK

Tue Sep 13 19:52:59 UTC 2022

Hello hotspot-dev and hotspot-gc-dev,

My name is Jonathan, and I'm working on the Java Platform Team at Google.
Here, we are working on a project to address Java container memory issues,
as we noticed that a significant number of Java servers hit container OOM
issues due to people incorrectly tuning their heap size with respect to the
container size. Because our containers have other RAM consumers which
fluctuate over time, it is often difficult to determine a priori what is an
appropriate Xmx to set for a particular server.

We set about trying to solve this by dynamically adjusting the Java heap/gc
behavior based on the container usage information that we pass into the
JVM. We have seen promising results so far, reducing container OOMs by a
significant amount, and oftentimes also reducing average heap usage (with
the tradeoff of more CPU time spent doing GC).

Below (under the dotted line) is a more detailed explanation of our initial
approach. Does this sound like something that may be useful for the general
OpenJDK community? If so, would some of you be open to further discussion?
I would also like to better understand what container environments look
like outside of Google, to see how we could modify our approach for the
more general case.

Thank you!

Jonathan
------------------------------------------------------------------------
Introduction:

Adaptable Heap Sizing (AHS) is a project internal to Google that is meant
to simplify configuration and improve the stability of applications in
container environments. The key is that in a containerized environment, we
have access to container usage and limit information. This can be used as a
signal to modify Java heap behavior, helping prevent container OOMs.
Problem:

   -

   Containers at Google must be properly sized to not only the JVM heap,
   but other memory consumers as well. These consumers include non-heap Java
   (e.g. native code allocations), and simultaneously running non-Java
   processes.
   -

   Common antipattern we see here at Google:
   -

      We have an application running into container OOMs.
      -

      An engineer raises both container memory limit and Xmx by the same
      amount, since there appears to be insufficient memory.
      -

      The application has reduced container OOMs, but is still prone to
      them, since G1 continues to use most of Xmx.
      -

   This results in many jobs being configured with much more RAM than they
   need, but still running into container OOM issues.

Hypothesis:

   -

   For preventing container OOM: Why can't heap expansions be bounded by
   the remaining free space in the container?
   -

   For preventing the `unnecessarily high Xmx` antipattern: Why can't
   target heap size be set based on GC CPU overhead?
   -

   From our work on Adaptable Heap Sizing, it appears they can!

Design:

   -

   We add two manageable flags in the JVM
   -

      Current maximum heap expansion size
      -

      Current target heap size
      -

   A separate thread runs alongside the JVM, querying:
   -

      Container memory usage/limits
      -

      GC CPU overhead metrics from the JVM.
      -

   This thread then uses this information to calculate new values for the
   two new JVM flags, and continually updates them at runtime.
   -

   The `Current maximum heap expansion size` informs the JVM what is the
   maximum amount we can expand the heap by, while staying within container
   limits. This is a hard limit, and trying to expand more than this amount
   results in behavior equivalent to hitting the Xmx limit.
   -

   The `Current target heap size` is a soft target value, which is used to
   resize the heap (when possible) so as to bring GC CPU overhead toward its
   target value.

Results:

   -

   At Google, we have found that this design works incredibly well in our
   initial rollout, even for large and complex workloads.
   -

   After deploying this to dozens of applications:
   -

      Significant memory savings for previously misconfigured jobs (many of
      which reduced their heap usage by 50% or more)
      -

      Significantly reduced occurrences of container OOM (100% reduction in
      vast majority of cases)
      -

      No correctness issues
      -

      No latency regressions*
      -

      We plan to deploy AHS across a much wider subset of applications by
      EOY '22.

*Caveats:

   - Enabling this feature might require tuning of the newly introduced
   default GC CPU overhead target to avoid regressions.
   -

   Time spent doing GC for an application may increase significantly
   (though generally we've seen in practice that even if this is the case,
   end-to-end latency does not increase a noticeable amount)
   -

   Enabling AHS results in frequent heap resizings, but we have not seen
   evidence of any negative effects as a result of these more frequent heap
   resizings.
   -

   AHS is not necessarily a replacement for proper JVM tuning, but should
   generally work better than an untuned or improperly tuned configuration.
   -

   AHS is not intended for every possible workload, and there could be
   pathological cases where AHS results in worse behavior.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20220913/42e6cbe0/attachment-0001.htm>