Container-aware heap sizing for OpenJDK

Thu Sep 15 20:51:55 UTC 2022

Hi Ashutosh,

Thank you for the questions!

How is the container memory limit being determined? Does that process take
> into account non-Java processes running in the container as well?
>

In general, container memory limit at Google is determined through a
trial-and-error process, where it takes into account all memory consumers.
We generally start with lower limits, and as traffic increases, increase
the limits. We also have some features that allow us to automatically come
up with estimates for these values based on the workloads.

It makes sense to raise the container memory limit, but what is the need to
> raise the Xmx by the same amount?
>

It can be difficult to tell for someone not experienced in Java
configuration to know whether the memory issues they are seeing are solved
by providing more Java heap, or keeping the Java heap the same while
increasing the container limit. The way we configure Java memory by default
is by setting `Xmx = container_limit - non_heap_memory`, and thus if
someone were to only raise the container limit and not also increase
`non_heap_memory`, Xmx would also increase as well. So I would say that
increasing Xmx is not always the intended action, but in general,
demonstrates that it can be tricky to set everything in the right way.

If so, then bounding the heap expansion would not cover all the cases.
>

Yes, that is true - a sudden increase in non-heap memory usage can still
result in a container OOM.  However in this scenario, AHS should decrease
the proposed heap size (based on seeing that th container is getting full),
and therefore run more GCs to try to keep the heap as low as possible to
accommodate for this. But yes there are definitely pathological cases (for
example, a sudden large spike in non-heap usage) where AHS cannot shrink
the heap fast enough and still run into container OOMs.

I guess it really depends on how much room is left for java heap expansion,
> which brings us back to right sizing the container memory limt.
>

Agreed - once AHS is active, then we only really have to worry about
right-sizing the container memory limit. But this is easier to do than
trying to right-size both the container memory limit as well as the
distribution of Java heap/non-heap!

Thank you for taking the time to look through this and spark some
discussion. Please feel free to ask any other questions you may have!

~ Jonathan

On Wed, Sep 14, 2022 at 10:35 AM Ashutosh Mehra <asmehra at redhat.com> wrote:

> Hi Jonathan,
>
> Thanks for sharing your work here.
> I have a few questions to understand the idea better.
>
>
>>    -
>>
>>    Containers at Google must be properly sized to not only the JVM heap,
>>    but other memory consumers as well. These consumers include non-heap Java
>>    (e.g. native code allocations), and simultaneously running non-Java
>>    processes.
>>
>>
> How is the container memory limit being determined? Does that process take
> into account non-Java processes running in the container as well?
>
>
>>    -
>>
>>    We have an application running into container OOMs.
>>    -
>>
>>    An engineer raises both container memory limit and Xmx by the same
>>    amount, since there appears to be insufficient memory.
>>
>>
> If I understand it correctly, the problem appears to be that when the JVM
> tries to expand the heap within Xmx limits, as there are other non-Java
> processes consuming memory,
> the total used memory of the container reaches the container limit and
> results in container OOMs.
> It makes sense to raise the container memory limit, but what is the need
> to raise the Xmx by the same amount?
>
>
>>    -
>>
>>    For preventing container OOM: Why can't heap expansions be bounded by
>>    the remaining free space in the container?
>>
>>
> I am wondering if Java heap expansion is always the cause of container
> OOM?
> As you mentioned earlier, there are other non-Java processes and other
> components in Java that consume native heap. I believe they too can be the
> source of container OOM.
> If so, then bounding the heap expansion would not cover all the cases.
>
> Time spent doing GC for an application may increase significantly (though
>> generally we've seen in practice that even if this is the case, end-to-end
>> latency does not increase a noticeable amount)
>>
>
> I guess it really depends on how much room is left for java heap
> expansion, which brings us back to right sizing the container memory limt.
>
> Regards,
> Ashutosh Mehra
>
> On Tue, Sep 13, 2022 at 3:54 PM Jonathan Joo <jonathanjoo at google.com>
> wrote:
>
>> Hello hotspot-dev and hotspot-gc-dev,
>>
>> My name is Jonathan, and I'm working on the Java Platform Team at Google.
>> Here, we are working on a project to address Java container memory issues,
>> as we noticed that a significant number of Java servers hit container OOM
>> issues due to people incorrectly tuning their heap size with respect to the
>> container size. Because our containers have other RAM consumers which
>> fluctuate over time, it is often difficult to determine a priori what is an
>> appropriate Xmx to set for a particular server.
>>
>> We set about trying to solve this by dynamically adjusting the Java
>> heap/gc behavior based on the container usage information that we pass into
>> the JVM. We have seen promising results so far, reducing container OOMs by
>> a significant amount, and oftentimes also reducing average heap usage (with
>> the tradeoff of more CPU time spent doing GC).
>>
>> Below (under the dotted line) is a more detailed explanation of our
>> initial approach. Does this sound like something that may be useful for the
>> general OpenJDK community? If so, would some of you be open to further
>> discussion? I would also like to better understand what container
>> environments look like outside of Google, to see how we could modify our
>> approach for the more general case.
>>
>> Thank you!
>>
>>
>> Jonathan
>> ------------------------------------------------------------------------
>> Introduction:
>>
>> Adaptable Heap Sizing (AHS) is a project internal to Google that is meant
>> to simplify configuration and improve the stability of applications in
>> container environments. The key is that in a containerized environment, we
>> have access to container usage and limit information. This can be used as a
>> signal to modify Java heap behavior, helping prevent container OOMs.
>> Problem:
>>
>>    -
>>
>>    Containers at Google must be properly sized to not only the JVM heap,
>>    but other memory consumers as well. These consumers include non-heap Java
>>    (e.g. native code allocations), and simultaneously running non-Java
>>    processes.
>>    -
>>
>>    Common antipattern we see here at Google:
>>    -
>>
>>       We have an application running into container OOMs.
>>       -
>>
>>       An engineer raises both container memory limit and Xmx by the same
>>       amount, since there appears to be insufficient memory.
>>       -
>>
>>       The application has reduced container OOMs, but is still prone to
>>       them, since G1 continues to use most of Xmx.
>>       -
>>
>>    This results in many jobs being configured with much more RAM than
>>    they need, but still running into container OOM issues.
>>
>> Hypothesis:
>>
>>    -
>>
>>    For preventing container OOM: Why can't heap expansions be bounded by
>>    the remaining free space in the container?
>>    -
>>
>>    For preventing the `unnecessarily high Xmx` antipattern: Why can't
>>    target heap size be set based on GC CPU overhead?
>>    -
>>
>>    From our work on Adaptable Heap Sizing, it appears they can!
>>
>> Design:
>>
>>    -
>>
>>    We add two manageable flags in the JVM
>>    -
>>
>>       Current maximum heap expansion size
>>       -
>>
>>       Current target heap size
>>       -
>>
>>    A separate thread runs alongside the JVM, querying:
>>    -
>>
>>       Container memory usage/limits
>>       -
>>
>>       GC CPU overhead metrics from the JVM.
>>       -
>>
>>    This thread then uses this information to calculate new values for
>>    the two new JVM flags, and continually updates them at runtime.
>>    -
>>
>>    The `Current maximum heap expansion size` informs the JVM what is the
>>    maximum amount we can expand the heap by, while staying within container
>>    limits. This is a hard limit, and trying to expand more than this amount
>>    results in behavior equivalent to hitting the Xmx limit.
>>    -
>>
>>    The `Current target heap size` is a soft target value, which is used to
>>    resize the heap (when possible) so as to bring GC CPU overhead toward its
>>    target value.
>>
>>
>> Results:
>>
>>    -
>>
>>    At Google, we have found that this design works incredibly well in
>>    our initial rollout, even for large and complex workloads.
>>    -
>>
>>    After deploying this to dozens of applications:
>>    -
>>
>>       Significant memory savings for previously misconfigured jobs (many
>>       of which reduced their heap usage by 50% or more)
>>       -
>>
>>       Significantly reduced occurrences of container OOM (100% reduction
>>       in vast majority of cases)
>>       -
>>
>>       No correctness issues
>>       -
>>
>>       No latency regressions*
>>       -
>>
>>       We plan to deploy AHS across a much wider subset of applications
>>       by EOY '22.
>>
>>
>> *Caveats:
>>
>>    - Enabling this feature might require tuning of the newly introduced
>>    default GC CPU overhead target to avoid regressions.
>>    -
>>
>>    Time spent doing GC for an application may increase significantly
>>    (though generally we've seen in practice that even if this is the case,
>>    end-to-end latency does not increase a noticeable amount)
>>    -
>>
>>    Enabling AHS results in frequent heap resizings, but we have not seen
>>    evidence of any negative effects as a result of these more frequent heap
>>    resizings.
>>    -
>>
>>    AHS is not necessarily a replacement for proper JVM tuning, but
>>    should generally work better than an untuned or improperly tuned
>>    configuration.
>>    -
>>
>>    AHS is not intended for every possible workload, and there could be
>>    pathological cases where AHS results in worse behavior.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20220915/96d9dcb7/attachment-0001.htm>