[RFC 8285277] - How should the JVM handle container memory limits

Thu Apr 28 07:50:05 UTC 2022

>
>
> >>
> >> - If we want to efficiently use all the memory we paid for in a
> >> container, we need to tune the VM parameters manually
> >>
> >> - But tuning is very difficult and error prone. With the 100mb heap
> >> setting, my app could easily be killed if it uses too much memory in
> >> malloc, native buffers, or even the JIT. Currently we don't fail early
> >> (no check of -Xms, -Xmx, etc, against available memory), and don't fail
> >> predictably.
> > +1 to doing some more sanity checks on JVM startup when some of those
> > settings exceed physical memory.
>
> I think this makes sense. We need to decide what the behavior should be:
>
> - what flags to check
> - print warnings?
> - abort the VM?
> - override the settings to conform to available memory?
> - always do this, or only when running inside a container? If the
> latter, we will have the problem of  JDK-8261242
>
>
Please only a warning!

There may be valid reasons for starting with a massively overextended
(uncommitted) heap, e.g. sparse databases.

Maybe this is just a documentation problem? -Xmx only reserves address
space. In default overcommit mode, you can happily run with Xmx10000G on a
modern Linux box. Xms is subject to the overcommit heuristics of the
underlying system, which by default are somewhat larger than mem+swap but
not endlessly so.

If you want an early error when running with too large a heap, you can
prevent that today by switching off overcommit heuristics
(vm.overcommit_memory=2), setting overcommit_ratio to something like 100
(if you are the only big process on that container), and start the VM with
-Xmx == -Xms. Am I overlooking something?

--

Some more thoughts:

One problem is to correctly estimate the amount of native memory we need.
That is very difficult. It depends on many things, e.g.
- number of threads, and avg depth of thread stacks
- number of classes loaded, and class load churn (metaspace fragmentation)
- Used collector (each collector has additional data structures)
- malloc granularity and frequency. malloc overhead can be absymal,
starting from the NMT malloc header down to the glibc overhead. I have
observed factor 4-6 ratio malloc use payload to malloc cost in poorly
written applications.

Most of this is known, right, this is nothing new.

Problem with all these scenarios is that there is no fallback. In most
cases with native memory, if you run out, you are done. What would be nice
is to pre-allocate and pre-commit certain areas which are normally
allocated on demand, to trigger certain OOMs early. We cannot do this for
every allocation, but some ideas are:
1) modify Metaspace to preallocate memory - at the moment it's tuned to
allocate on-demand only, but that feature may be counterintuitive in
containers. One example could be to divert from the current
growable-list-of-mappings-model to something simpler like what today is the
class space (one large preallocated mapping)
2) For C-heap, you could change os::malloc() to not use C-heap but an
internal, pre-allocated buffer. Which you then can allocate and commit
upfront. The problem with that approach is that you need to roll your own
allocator, which is absolutely not trivial if you want it to be production
grade quality (reliable, fast in mt scenarios, somewhat memory conservative
- although glibc does not shine there either tbh)
3) A simpler alternative to (2) would be to estimate the C-heap usage
(would have to do this anyway) and, on VM start, preallocate it to test if
we can. Then release it again. Or keep it, release it on demand like a
balloon if memory pressure grows. Though that would incur unnecessary
memory costs.

Cheers, Thomas